I don't think it's ever been discussed - your Q below is #1 hit currently: http://search-lucene.com/?q=%2B%28dih+OR+dataimporthandler%29+hdfs Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ---- > From: Jon Baer <jonb...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Tue, June 22, 2010 12:47:14 PM > Subject: Re: solr with hadoop > > I was playing around w/ Sqoop the other day, its a simple Cloudera tool for > imports (mysql -> hdfs) @ > href="http://www.cloudera.com/developers/downloads/sqoop/" target=_blank > >http://www.cloudera.com/developers/downloads/sqoop/ It seems to me > (it would be pretty efficient) to dump to HDFS and have something like Data > Import Handler be able to read from hdfs:// directly ... Has this route > been discussed / developed before (ie DIH w/ hdfs:// handler)? - > Jon On Jun 22, 2010, at 12:29 PM, MitchK wrote: > > I > wanted to add a Jira-issue about exactly what Otis is asking here. > > Unfortunately, I haven't time for it because of my exams. > > > However, I'd like to add a question to Otis' ones: > If you destribute the > indexing-progress this way, are you able to replicate > the different > documents correctly? > > Thank you. > - Mitch > > > Otis Gospodnetic-2 wrote: >> >> Stu, >> > >> Interesting! Can you provide more details about your > setup? By "load >> balance the indexing stage" you mean > "distribute the indexing process", >> right? Do you simply take > your content to be indexed, split it into N >> chunks where N matches > the number of TaskNodes in your Hadoop cluster and >> provide a map > function that does the indexing? What does the reduce >> function > do? Does that call IndexWriter.addAllIndexes or do you do that >> > outside Hadoop? >> >> Thanks, >> Otis >> > -- >> Sematext -- > >http://sematext.com/ -- Lucene - Solr - Nutch >> >> > ----- Original Message ---- >> From: Stu Hood < > ymailto="mailto:stuh...@webmail.us" > href="mailto:stuh...@webmail.us">stuh...@webmail.us> >> To: > ymailto="mailto:solr-user@lucene.apache.org" > href="mailto:solr-user@lucene.apache.org">solr-user@lucene.apache.org >> > Sent: Monday, January 7, 2008 7:14:20 PM >> Subject: Re: solr with > hadoop >> >> As Mike suggested, we use Hadoop to organize our > data en route to Solr. >> Hadoop allows us to load balance the indexing > stage, and then we use >> the raw Lucene IndexWriter.addAllIndexes > method to merge the data to be >> hosted on Solr instances. >> > >> Thanks, >> Stu >> >> >> > >> -----Original Message----- >> From: Mike Klaas < > ymailto="mailto:mike.kl...@gmail.com" > href="mailto:mike.kl...@gmail.com">mike.kl...@gmail.com> >> > Sent: Friday, January 4, 2008 3:04pm >> To: > ymailto="mailto:solr-user@lucene.apache.org" > href="mailto:solr-user@lucene.apache.org">solr-user@lucene.apache.org >> > Subject: Re: solr with hadoop >> >> On 4-Jan-08, at 11:37 AM, > Evgeniy Strokin wrote: >> >>> I have huge index base > (about 110 millions documents, 100 fields >>> each). But size > of the index base is reasonable, it's about 70 Gb. >>> All I > need is increase performance, since some queries, which match > >>> big number of documents, are running slow. >>> So I > was thinking is any benefits to use hadoop for this? And if > >>> so, what direction should I go? Is anybody did something > for >>> integration Solr with Hadoop? Does it give any > performance boost? >>> >> Hadoop might be useful for > organizing your data enroute to Solr, but >> I don't see how it > could be used to boost performance over a huge >> Solr > index. To accomplish that, you need to split it up over two > >> machines (for which you might find hadoop useful). >> > >> -Mike >> >> >> >> > >> >> >> > -- > View this message in > context: > href="http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html" > > target=_blank > >http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html > > Sent from the Solr - User mailing list archive at Nabble.com.