Dear reader,

I'm trying to use solr for a hierarchical search:
metadata from the higher-levelled elements is copied to the lower ones,
and each element has the complete ocr text which it belongs to.

At volume level, of course, we will have the complete ocr text in one
<doc> and we need to store it for highlighting.

My solr instance is configured like this:
java -Xms12000m -Xmx12000m -jar start.jar
[ imported with 4.7.0, performance tests with 4.8.0 ]

Solr index files are of this size:
  0.013gb .tip The index into the Term Dictionary
  0.017gb .nvd Encodes length and boost factors for docs and fields
  0.546gb .tim The term dictionary, stores term info
  1.332gb .doc Contains the list of docs which contain each term along
with frequency
  4.943gb .pos Stores position information about where a term occurs in
the index
 12.743gb .tvd Contains information about each document that has term
vectors
 17.340gb .fdt The stored fields for documents "ocr"

Configuring the ocr field as non-stored I'll get those performance
measures (see docs/s) after warmup:

jb@serv7:~> perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
&q={%21q.op%3dAND}ocr%3A%28zeit%29
&fq=mashed_b%3Afalse
&fl=id
&sort=sort_name_s asc,id+asc
&rows=1000000
time: 3.96 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
16353 docs/s; 0.474 MB/s

... and with ocr stored, even _not_ requesting ocr with fl=... with
disabled <documentCache class="solr.LRUCache" ... /> and
<enableLazyFieldLoading>false</enableLazyFieldLoading>
[ with <documentCache and <enableLazyFieldLoading results are even worser ]

... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
jb@serv7:~> perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
&q={%21q.op%3dAND}ocr%3A%28zeit%29
&fq=mashed_b%3Afalse
&fl=id
&sort=sort_name_s asc,id+asc
&rows=1000000
time: 61.58 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1052 docs/s; 0.030 MB/s

... using solr-4.8.0 and oracle-jdk1.7.0_55 :
jb@serv7:~> perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json&q={%21q.op%3dAND}ocr%3A%28zeit%29
&fq=mashed_b%3Afalse
&fl=id
&sort=sort_name_s asc,id+asc
&rows=1000000
time: 58.80 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1102 docs/s; 0.032 MB/s

Is there any reason why stored vs non-stored is 16 times slower?
Is there a way to "store ocr" field in a separate index or somethings
like this?

Kind regards,
J. Barth




-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

Reply via email to