Hi,
We're indexing a lot of dirty OCR. So the index is really huge due to
the size of the position file. We still get ok response time though
with a median of 100ms. Phrase queries are a different matter
obviously. But we're seeing some really large increases in index size
as we add a couple of fields that do not make sense.
Our 500,000 document index is 120G. It's simple schema is:
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="ocr" type="Ocr" indexed="true" stored="false" required="true"/>
<field name="title" type="Ocr" indexed="true" stored="true"
required="true"/>
<field name="author" type="Ocr" indexed="true" stored="true"
required="true"/>
<field name="rights" type="sint" indexed="true" stored="true"
required="true"/>
We added the following 2 fields to the above schema as follows:
<field name="date" type="date" indexed="true" stored="true"
required="true"/>
<field name="hlb" type="string" indexed="true" stored="true"
multiValued="true"/>
where the "hlb" field consists of not more than 3-4 strings such as
"Social Sicence"/
Our 500,000 document index size increased to 166G! This seems
completely wrong. Looking at the directory listings for each case it
appears every one of the files grew in size.
How can this be?
Phil
===
120G index:
-rw-r--r-- 1 tomcat admin 81023261 Sep 24 06:00 _fj.fdt
-rw-r--r-- 1 tomcat admin 4000072 Sep 24 06:00 _fj.fdx
-rw-r--r-- 1 tomcat admin 33 Sep 24 06:00 _fj.fnm
-rw-r--r-- 1 tomcat admin 14069125169 Sep 24 06:16 _fj.frq
-rw-r--r-- 1 tomcat admin 1500031 Sep 24 06:16 _fj.nrm
-rw-r--r-- 1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
-rw-r--r-- 1 tomcat admin 58677668 Sep 24 08:25 _fj.tii
-rw-r--r-- 1 tomcat admin 4319853217 Sep 24 08:32 _fj.tis
-rw-r--r-- 1 tomcat admin 42 Sep 24 08:32 segments_fo
-rw-r--r-- 1 tomcat admin 20 Sep 24 08:32 segments.gen
166G index (+ 2 fields)
-rw-r--r-- 1 tomcat admin 113530692 Oct 21 10:42 _fh.fdt
-rw-r--r-- 1 tomcat admin 3960256 Oct 21 10:42 _fh.fdx
-rw-r--r-- 1 tomcat admin 44 Oct 21 10:42 _fh.fnm
-rw-r--r-- 1 tomcat admin 15242830112 Oct 21 12:58 _fh.frq
-rw-r--r-- 1 tomcat admin 1485100 Oct 21 12:58 _fh.nrm
-rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
-rw-r--r-- 1 tomcat admin 72760439 Oct 21 12:58 _fh.tii
-rw-r--r-- 1 tomcat admin 5337669551 Oct 21 12:58 _fh.tis
-rw-r--r-- 1 tomcat admin 42 Oct 21 12:58 segments_fk
-rw-r--r-- 1 tomcat admin 20 Oct 21 12:58 segments.gen