
We're indexing a lot of dirty OCR. So the index is really huge due to the size of the position file. We still get ok response time though with a median of 100ms. Phrase queries are a different matter obviously. But we're seeing some really large increases in index size as we add a couple of fields that do not make sense.

Our 500,000 document index is 120G. It's simple schema is:

<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="ocr" type="Ocr" indexed="true" stored="false" required="true"/>
<field name="title" type="Ocr" indexed="true" stored="true" required="true"/> <field name="author" type="Ocr" indexed="true" stored="true" required="true"/> <field name="rights" type="sint" indexed="true" stored="true" required="true"/>

We added the following 2 fields to the above schema as follows:

<field name="date" type="date" indexed="true" stored="true" required="true"/> <field name="hlb" type="string" indexed="true" stored="true" multiValued="true"/>

where the "hlb" field consists of not more than 3-4 strings such as "Social Sicence"/

Our 500,000 document index size increased to 166G! This seems completely wrong. Looking at the directory listings for each case it appears every one of the files grew in size.

How can this be?



120G index:

-rw-r--r--  1 tomcat admin     81023261 Sep 24 06:00 _fj.fdt
-rw-r--r--  1 tomcat admin      4000072 Sep 24 06:00 _fj.fdx
-rw-r--r--  1 tomcat admin           33 Sep 24 06:00 _fj.fnm
-rw-r--r--  1 tomcat admin  14069125169 Sep 24 06:16 _fj.frq
-rw-r--r--  1 tomcat admin      1500031 Sep 24 06:16 _fj.nrm
-rw-r--r--  1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
-rw-r--r--  1 tomcat admin     58677668 Sep 24 08:25 _fj.tii
-rw-r--r--  1 tomcat admin   4319853217 Sep 24 08:32 _fj.tis
-rw-r--r--  1 tomcat admin           42 Sep 24 08:32 segments_fo
-rw-r--r--  1 tomcat admin           20 Sep 24 08:32 segments.gen

166G index (+ 2 fields)

-rw-r--r-- 1 tomcat admin    113530692 Oct 21 10:42 _fh.fdt
-rw-r--r-- 1 tomcat admin      3960256 Oct 21 10:42 _fh.fdx
-rw-r--r-- 1 tomcat admin           44 Oct 21 10:42 _fh.fnm
-rw-r--r-- 1 tomcat admin  15242830112 Oct 21 12:58 _fh.frq
-rw-r--r-- 1 tomcat admin      1485100 Oct 21 12:58 _fh.nrm
-rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
-rw-r--r-- 1 tomcat admin     72760439 Oct 21 12:58 _fh.tii
-rw-r--r-- 1 tomcat admin   5337669551 Oct 21 12:58 _fh.tis
-rw-r--r-- 1 tomcat admin           42 Oct 21 12:58 segments_fk
-rw-r--r-- 1 tomcat admin           20 Oct 21 12:58 segments.gen

Reply via email to