RE: Separating the document dataset and the index dataset

Jain Rahul Tue, 11 Dec 2012 01:48:47 -0800

Hi Ram,

You need to have lucene-codec.jar in classpath having the CompressingCodec.java 
and other related classes.


If you are having your stuff on top of lucene then you can set it by calling 
setCodec(Codec codec) in IndexWriterConfig.

But If you are using solr, then since I couldn't figure out a clean way to do 
it with solr, I just did a small below heck in Codec.java. So someone from 
community can guide us on it for a neat solution.

In org.apache.lucene.codecs .Codec.java by default it sets Lucene40 as default 
field format, I just changed it to allow to pass the "compressing" codec like 
-Dlucene.codec=Compressing

  //private static Codec defaultCodec = Codec.forName("Lucene40");
  private static Codec defaultCodec = 
Codec.forName(System.getProperty("lucene.codec", "Lucene40"));

Regards,
Rahul

-----Original Message-----
From: Ramprakash Ramamoorthy [mailto:[email protected]]
Sent: 11 December 2012 15:02
To: [email protected]
Subject: Re: Separating the document dataset and the index dataset

On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <[email protected]> wrote:

> If you are using lucene 4.0 and afford to compress your document
> dataset while indexing, it will be a huge savings in terms of disk
> space and also in IO (resulting in indexing throughput).
>
> In our case, it has helped us a lot as compressed data size was
> roughly 3 times less than  of original document data set size.
>
> You may want to check  the below  link.
>
>
> http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-f
> ields-with-lucene
>
> Regards,
> Rahul
>

Thank you Rahul. That indeed seems promising. Just one doubt, how do I plug 
this  CompressingStoredFieldsFormat into my app, as in I tried bundling it in a 
codec, but not sure if I am proceeding in the right path. Any pointers would be 
of great help!

>
>
> -----Original Message-----
> From: Ramprakash Ramamoorthy [mailto:[email protected]]
> Sent: 07 December 2012 13:03
> To: [email protected]
> Subject: Separating the document dataset and the index dataset
>
> Greetings,
>
>          We are using lucene in our log analysis tool. We get data
> around 35Gb a day and we have this practice of zipping week old
> indices and then unzip when need arises.
>
>            Though the compression offers a huge saving with respect to
> disk space, the decompression becomes an overhead. At times it takes
> around
> 10 minutes (de-compression takes 95% of the time) to search across a
> month long set of logs. We need to unzip fully atleast to get the
> total count from the index.
>
>            My question is, we are setting Index.Store to true. Is
> there a way where we can split the index dataset and the document
> dataset. In my understanding, if at all separation is possible, the
> document dataset can alone be zipped leaving the index dataset on
> disk? Will it be tangible to do this? Any pointers?
>
>            Or is adding more disks the only solution? Thanks in advance!
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> +91 9626975420
> This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended
> recipient dissemination or copying of this email is prohibited. If you
> have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system. Any views
> or opinions are solely those of the sender. This communication is not
> intended to form a binding contract unless expressly indicated to the
> contrary and properly authorised. Any actions taken on the basis of
> this email are at the recipient's own risk.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


--
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420
This email and any attachments are confidential, and may be legally privileged 
and protected by copyright. If you are not the intended recipient dissemination 
or copying of this email is prohibited. If you have received this in error, 
please notify the sender by replying by email and then delete the email 
completely from your system. Any views or opinions are solely those of the 
sender. This communication is not intended to form a binding contract unless 
expressly indicated to the contrary and properly authorised. Any actions taken 
on the basis of this email are at the recipient's own risk.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Separating the document dataset and the index dataset

Reply via email to