Re: Creating Lucene index in Hadoop

Jason Venner Wed, 07 Oct 2009 09:10:51 -0700

Check out katta, as it can pull indexes from hdfs and deploy them into your
search cluster.
Katta also handles index directories that have been packed into a zip file.
Katta can pull indexes from any file system that hadoop supports, hdfs, s3,
hftp, file etc.


We have been doing this with our solr (solr-1301) indexes and getting an 80%
reduction in size, which is a big gain for us.

I need to feed a 2 line change back into solr-1301 as the close method can
fail to heart beat while the optimize is happening, in some situations right
now.


On Tue, Oct 6, 2009 at 9:30 PM, ctam <ctamra...@gmail.com> wrote:

>
> hi Ning , I am also looking at different approaches on indexing with hadoop
> ,
> I could index using contrib package for hadoop into HDFS but since its not
> designed for random access  what would be the other recommended ways to
> move
> them to Local file system
>
> Also what would be the best approach to begin with ? should we look into
> katta or solr integrations ?
>
> thanks in advance.
>
>
> Ning Li-5 wrote:
> >
> >> I'm missing why you would ever want the Lucene index in HDFS for
> >> reading.
> >
> > The Lucene indexes are written to HDFS, but that does not mean you
> > conduct search on the indexes stored in HDFS directly. HDFS is not
> > designed for random access. Usually the indexes are copied to the
> > nodes where search will be served. With
> > http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
> > become feasible to search on HDFS directly.
> >
> > Cheers,
> > Ning
> >
> >
> > On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff <ian.sobor...@nist.gov>
> > wrote:
> >>
> >> Does anyone have stats on how multiple readers on an optimized Lucene
> >> index in HDFS compares with a ParallelMultiReader (or whatever its
> >> called) over RPC on a local filesystem?
> >>
> >> I'm missing why you would ever want the Lucene index in HDFS for
> >> reading.
> >>
> >> Ian
> >>
> >> Ning Li <ning.li...@gmail.com> writes:
> >>
> >>> I should have pointed out that Nutch index build and contrib/index
> >>> targets different applications. The latter is for applications who
> >>> simply want to build Lucene index from a set of documents - e.g. no
> >>> link analysis.
> >>>
> >>> As to writing Lucene indexes, both work the same way - write the final
> >>> results to local file system and then copy to HDFS. In contrib/index,
> >>> the intermediate results are in memory and not written to HDFS.
> >>>
> >>> Hope it clarifies things.
> >>>
> >>> Cheers,
> >>> Ning
> >>>
> >>>
> >>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.sobor...@nist.gov>
> >>> wrote:
> >>>>
> >>>> I understand why you would index in the reduce phase, because the
> >>>> anchor
> >>>> text gets shuffled to be next to the document.  However, when you
> index
> >>>> in the map phase, don't you just have to reindex later?
> >>>>
> >>>> The main point to the OP is that HDFS is a bad FS for writing Lucene
> >>>> indexes because of how Lucene works.  The simple approach is to write
> >>>> your index outside of HDFS in the reduce phase, and then merge the
> >>>> indexes from each reducer manually.
> >>>>
> >>>> Ian
> >>>>
> >>>> Ning Li <ning.li...@gmail.com> writes:
> >>>>
> >>>>> Or you can check out the index contrib. The difference of the two is
> >>>>> that:
> >>>>>   - In Nutch's indexing map/reduce job, indexes are built in the
> >>>>> reduce phase. Afterwards, they are merged into smaller number of
> >>>>> shards if necessary. The last time I checked, the merge process does
> >>>>> not use map/reduce.
> >>>>>   - In contrib/index, small indexes are built in the map phase. They
> >>>>> are merged into the desired number of shards in the reduce phase. In
> >>>>> addition, they can be merged into existing shards.
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <imcap...@126.com> wrote:
> >>>>>> you can see the nutch code.
> >>>>>>
> >>>>>> 2009/3/13 Mark Kerzner <markkerz...@gmail.com>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> How do I allow multiple nodes to write to the same index file in
> >>>>>>> HDFS?
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Mark
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: Creating Lucene index in Hadoop

Reply via email to