Re: Separating the document dataset and the index dataset

Ramprakash Ramamoorthy Tue, 11 Dec 2012 02:36:05 -0800

On Tue, Dec 11, 2012 at 3:14 PM, Uwe Schindler <[email protected]> wrote:


> You can use Lucene 4.1 nightly builds from http://goo.gl/jZ6YD - it is
> not yet released, but upgrading from Lucene 4.0 is easy. If you are not yet
> on Lucene 4.0, there is more work to do, in that case a solution to your
> problem would be to save the stored fields in a separate database/whatever
> and only add *one* stored field to your index, containing the document ID
> inside this external database.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]


Thank you Uwe. Already tried with the nightly build, but the codecs.jar in
it isn't having a compressing codec at all, Tried pulling out from the
trunk and then compiling, same issue,
*org.apache.lucene.codecs.compressing*is missing. Any pointers?

>
>
>
> > -----Original Message-----
> > From: Ramprakash Ramamoorthy [mailto:[email protected]]
> > Sent: Tuesday, December 11, 2012 10:32 AM
> > To: [email protected]
> > Subject: Re: Separating the document dataset and the index dataset
> >
> > On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <[email protected]>
> wrote:
> >
> > > If you are using lucene 4.0 and afford to compress your document
> > > dataset while indexing, it will be a huge savings in terms of disk
> > > space and also in IO (resulting in indexing throughput).
> > >
> > > In our case, it has helped us a lot as compressed data size was
> > > roughly 3 times less than  of original document data set size.
> > >
> > > You may want to check  the below  link.
> > >
> > >
> > > http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-f
> > > ields-with-lucene
> > >
> > > Regards,
> > > Rahul
> > >
> >
> > Thank you Rahul. That indeed seems promising. Just one doubt, how do I
> > plug this  CompressingStoredFieldsFormat into my app, as in I tried
> bundling
> > it in a codec, but not sure if I am proceeding in the right path. Any
> pointers
> > would be of great help!
> >
> > >
> > >
> > > -----Original Message-----
> > > From: Ramprakash Ramamoorthy [mailto:[email protected]]
> > > Sent: 07 December 2012 13:03
> > > To: [email protected]
> > > Subject: Separating the document dataset and the index dataset
> > >
> > > Greetings,
> > >
> > >          We are using lucene in our log analysis tool. We get data
> > > around 35Gb a day and we have this practice of zipping week old
> > > indices and then unzip when need arises.
> > >
> > >            Though the compression offers a huge saving with respect to
> > > disk space, the decompression becomes an overhead. At times it takes
> > > around
> > > 10 minutes (de-compression takes 95% of the time) to search across a
> > > month long set of logs. We need to unzip fully atleast to get the
> > > total count from the index.
> > >
> > >            My question is, we are setting Index.Store to true. Is
> > > there a way where we can split the index dataset and the document
> > > dataset. In my understanding, if at all separation is possible, the
> > > document dataset can alone be zipped leaving the index dataset on
> > > disk? Will it be tangible to do this? Any pointers?
> > >
> > >            Or is adding more disks the only solution? Thanks in
> advance!
> > >
> > > --
> > > With Thanks and Regards,
> > > Ramprakash Ramamoorthy,
> > > +91 9626975420
> > > This email and any attachments are confidential, and may be legally
> > > privileged and protected by copyright. If you are not the intended
> > > recipient dissemination or copying of this email is prohibited. If you
> > > have received this in error, please notify the sender by replying by
> > > email and then delete the email completely from your system. Any views
> > > or opinions are solely those of the sender. This communication is not
> > > intended to form a binding contract unless expressly indicated to the
> > > contrary and properly authorised. Any actions taken on the basis of
> > > this email are at the recipient's own risk.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > Engineer Trainee,
> > Zoho Corporation.
> > +91 9626975420
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420

Re: Separating the document dataset and the index dataset

Reply via email to