Couple of random thoughts:
1) The latest (4.8) Solr has support for nested documents, as well as
for expand components. Maybe that will let you have more efficient
architecture: http://heliosearch.org/expand-block-join/

2) Do you return OCR text to the client? Or just search it? If just
search it, you don't need to store it

3) If you do need to store it and return it, do you always have to
return it? If not, you could look at lazy-loading the field (setting
in solrconfig.xml).

4) Is OCR text or image? The stored fields are compressed by default,
I wonder if the compression/decompression of a large image is an
issue.

5) JDK 8 apparently makes Lucene much happier (speed of some
operations). Might be something to test if all else fails.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
<ba...@ub.uni-heidelberg.de> wrote:
> Dear reader,
>
> I'm trying to use solr for a hierarchical search:
> metadata from the higher-levelled elements is copied to the lower ones,
> and each element has the complete ocr text which it belongs to.
>
> At volume level, of course, we will have the complete ocr text in one
> <doc> and we need to store it for highlighting.
>
> My solr instance is configured like this:
> java -Xms12000m -Xmx12000m -jar start.jar
> [ imported with 4.7.0, performance tests with 4.8.0 ]
>
> Solr index files are of this size:
>   0.013gb .tip The index into the Term Dictionary
>   0.017gb .nvd Encodes length and boost factors for docs and fields
>   0.546gb .tim The term dictionary, stores term info
>   1.332gb .doc Contains the list of docs which contain each term along
> with frequency
>   4.943gb .pos Stores position information about where a term occurs in
> the index
>  12.743gb .tvd Contains information about each document that has term
> vectors
>  17.340gb .fdt The stored fields for documents "ocr"
>
> Configuring the ocr field as non-stored I'll get those performance
> measures (see docs/s) after warmup:
>
> jb@serv7:~> perl solr-performance.pl zeit 6
> http://127.0.0.1:58983/solr/collection1/select
> ?wt=json
> &q={%21q.op%3dAND}ocr%3A%28zeit%29
> &fq=mashed_b%3Afalse
> &fl=id
> &sort=sort_name_s asc,id+asc
> &rows=1000000
> time: 3.96 s
> bytes: 1.878 MB
> 64768 docs found; got 64768 docs
> 16353 docs/s; 0.474 MB/s
>
> ... and with ocr stored, even _not_ requesting ocr with fl=... with
> disabled <documentCache class="solr.LRUCache" ... /> and
> <enableLazyFieldLoading>false</enableLazyFieldLoading>
> [ with <documentCache and <enableLazyFieldLoading results are even worser ]
>
> ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
> jb@serv7:~> perl solr-performance.pl zeit 6
> http://127.0.0.1:58983/solr/collection1/select
> ?wt=json
> &q={%21q.op%3dAND}ocr%3A%28zeit%29
> &fq=mashed_b%3Afalse
> &fl=id
> &sort=sort_name_s asc,id+asc
> &rows=1000000
> time: 61.58 s
> bytes: 1.878 MB
> 64768 docs found; got 64768 docs
> 1052 docs/s; 0.030 MB/s
>
> ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
> jb@serv7:~> perl solr-performance.pl zeit 6
> http://127.0.0.1:58983/solr/collection1/select
> ?wt=json&q={%21q.op%3dAND}ocr%3A%28zeit%29
> &fq=mashed_b%3Afalse
> &fl=id
> &sort=sort_name_s asc,id+asc
> &rows=1000000
> time: 58.80 s
> bytes: 1.878 MB
> 64768 docs found; got 64768 docs
> 1102 docs/s; 0.032 MB/s
>
> Is there any reason why stored vs non-stored is 16 times slower?
> Is there a way to "store ocr" field in a separate index or somethings
> like this?
>
> Kind regards,
> J. Barth
>
>
>
>
> --
> J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580
>
> pgp public key:
> http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

Reply via email to