Re: Stored vs non-stored very large text fields

2014-05-05 Thread Jochen Barth
I'll found out that storing Documents as separate docs+id does not  
help either.

You must have an completely separate collection/core to get things work fast.

Kind regards,
Jochen


Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Ok, https://wiki.apache.org/solr/SolrPerformanceFactors

states that: Retrieving the stored fields of a query result can be  
a significant expense. This cost is affected largely by the number  
of bytes stored per document--the higher byte count, the sparser the  
documents will be distributed on disk and more I/O is necessary to  
retrieve the fields (usually this is a concern when storing large  
fields, like the entire contents of a document).


But in my case (with docValues=true) there should be no reason to  
access *.fdt.


Kind regards,
Jochen

Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Something is really strange here:

even when configuring fields id + sort_... to docValues=true --  
so there's nothing to get from stored documents file --  
performance is still terrible with ocr stored=true _even_ with my  
patch which stores uncompressed like solr4.0.0 (checked with  
strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one  
compressed chunk,

or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn





Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
Dear reader,

I'm trying to use solr for a hierarchical search:
metadata from the higher-levelled elements is copied to the lower ones,
and each element has the complete ocr text which it belongs to.

At volume level, of course, we will have the complete ocr text in one
doc and we need to store it for highlighting.

My solr instance is configured like this:
java -Xms12000m -Xmx12000m -jar start.jar
[ imported with 4.7.0, performance tests with 4.8.0 ]

Solr index files are of this size:
  0.013gb .tip The index into the Term Dictionary
  0.017gb .nvd Encodes length and boost factors for docs and fields
  0.546gb .tim The term dictionary, stores term info
  1.332gb .doc Contains the list of docs which contain each term along
with frequency
  4.943gb .pos Stores position information about where a term occurs in
the index
 12.743gb .tvd Contains information about each document that has term
vectors
 17.340gb .fdt The stored fields for documents ocr

Configuring the ocr field as non-stored I'll get those performance
measures (see docs/s) after warmup:

jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
q={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 3.96 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
16353 docs/s; 0.474 MB/s

... and with ocr stored, even _not_ requesting ocr with fl=... with
disabled documentCache class=solr.LRUCache ... / and
enableLazyFieldLoadingfalse/enableLazyFieldLoading
[ with documentCache and enableLazyFieldLoading results are even worser ]

... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
q={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 61.58 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1052 docs/s; 0.030 MB/s

... using solr-4.8.0 and oracle-jdk1.7.0_55 :
jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 58.80 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1102 docs/s; 0.032 MB/s

Is there any reason why stored vs non-stored is 16 times slower?
Is there a way to store ocr field in a separate index or somethings
like this?

Kind regards,
J. Barth




-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Alexandre Rafalovitch
Couple of random thoughts:
1) The latest (4.8) Solr has support for nested documents, as well as
for expand components. Maybe that will let you have more efficient
architecture: http://heliosearch.org/expand-block-join/

2) Do you return OCR text to the client? Or just search it? If just
search it, you don't need to store it

3) If you do need to store it and return it, do you always have to
return it? If not, you could look at lazy-loading the field (setting
in solrconfig.xml).

4) Is OCR text or image? The stored fields are compressed by default,
I wonder if the compression/decompression of a large image is an
issue.

5) JDK 8 apparently makes Lucene much happier (speed of some
operations). Might be something to test if all else fails.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
Am 29.04.2014 11:19, schrieb Alexandre Rafalovitch:
 Couple of random thoughts:
 1) The latest (4.8) Solr has support for nested documents, as well as
 for expand components. Maybe that will let you have more efficient
 architecture: http://heliosearch.org/expand-block-join/

Yes, I've seen this, but as far as I understood you have to know on
which nesting level you do your query.
My search should work on any level, say,

volume title 1986
chapter 1.1 author marc
chapter 1.1.3 title does not matter
chapter 1.1.3.1 title abc
chapter 1.1.3.2 title xyz

should match by querying +author:marc +title:abc // or // +author:marc
+title:xyz
but // not // +title:abc +title:xyz

(we'll have an unkown number of levels)


 2) Do you return OCR text to the client? Or just search it? If just
 search it, you don't need to store it

I'll want to get highlighted snippets.

 3) If you do need to store it and return it, do you always have to
 return it? If not, you could look at lazy-loading the field (setting
 in solrconfig.xml).

Let's see, perhaps this is a sorting problem which could be solved by
setting field id and sort_... to docValues=ture.

 4) Is OCR text or image? The stored fields are compressed by default,
 I wonder if the compression/decompression of a large image is an
 issue.

Text.


 5) JDK 8 apparently makes Lucene much happier (speed of some
 operations). Might be something to test if all else fails.

Ok...

Thanks,
J. Barth


 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
 ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?

Kind regards,
J. Barth



 
 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
 ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Shawn Heisey
On 4/29/2014 4:20 AM, Jochen Barth wrote:
 BTW: stored field compression:
 are all stored fields within a document are put into one compressed chunk,
 or by per-field basis?

Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn



Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Dear Shawn,

see attachment for my first brute force no-compression attempt.

Kind regards,
Jochen


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn



diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java
*** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java	2013-11-01 07:03:52.0 +0100
--- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java	2014-04-29 13:58:27.0 +0200
***
*** 38,43 
--- 38,44 
  import org.apache.lucene.codecs.lucene40.Lucene40NormsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat;
+ import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentInfo;
  import org.apache.lucene.store.Directory;
***
*** 56,62 
  @Deprecated
  public class Lucene41Codec extends Codec {
// TODO: slightly evil
!   private final StoredFieldsFormat fieldsFormat = new CompressingStoredFieldsFormat(Lucene41StoredFields, CompressionMode.FAST, 1  14) {
  @Override
  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
throw new UnsupportedOperationException(this codec can only be used for reading);
--- 57,63 
  @Deprecated
  public class Lucene41Codec extends Codec {
// TODO: slightly evil
!   private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat() {
  @Override
  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
throw new UnsupportedOperationException(this codec can only be used for reading);
diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java
*** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java	2013-11-01 07:03:52.0 +0100
--- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java	2014-04-29 13:57:08.0 +0200
***
*** 32,38 
  import org.apache.lucene.codecs.TermVectorsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
! import org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentWriteState;
--- 32,38 
  import org.apache.lucene.codecs.TermVectorsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
! import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentWriteState;
***
*** 53,59 
  // (it writes a minor version, etc).
  @Deprecated
  public class Lucene42Codec extends Codec {
!   private final StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsFormat();
private final TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat();
private final FieldInfosFormat fieldInfosFormat = new Lucene42FieldInfosFormat();
private final SegmentInfoFormat infosFormat = new Lucene40SegmentInfoFormat();
--- 53,59 
  // (it writes a minor version, etc).
  @Deprecated
  public class Lucene42Codec extends Codec {
!   private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
private final TermVectorsFormat vectorsFormat 

Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Something is really strange here:

even when configuring fields id + sort_... to docValues=true -- so  
there's nothing to get from stored documents file -- performance is  
still terrible with ocr stored=true _even_ with my patch which stores  
uncompressed like solr4.0.0 (checked with strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn





Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Ok, https://wiki.apache.org/solr/SolrPerformanceFactors

states that: Retrieving the stored fields of a query result can be a  
significant expense. This cost is affected largely by the number of  
bytes stored per document--the higher byte count, the sparser the  
documents will be distributed on disk and more I/O is necessary to  
retrieve the fields (usually this is a concern when storing large  
fields, like the entire contents of a document).


But in my case (with docValues=true) there should be no reason to  
access *.fdt.


Kind regards,
Jochen

Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Something is really strange here:

even when configuring fields id + sort_... to docValues=true -- so  
there's nothing to get from stored documents file -- performance  
is still terrible with ocr stored=true _even_ with my patch which  
stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one  
compressed chunk,

or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn