Re: Stored vs non-stored very large text fields
I'll found out that storing Documents as separate docs+id does not help either. You must have an completely separate collection/core to get things work fast. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Ok, https://wiki.apache.org/solr/SolrPerformanceFactors states that: Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document). But in my case (with docValues=true) there should be no reason to access *.fdt. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Stored vs non-stored very large text fields
Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
Couple of random thoughts: 1) The latest (4.8) Solr has support for nested documents, as well as for expand components. Maybe that will let you have more efficient architecture: http://heliosearch.org/expand-block-join/ 2) Do you return OCR text to the client? Or just search it? If just search it, you don't need to store it 3) If you do need to store it and return it, do you always have to return it? If not, you could look at lazy-loading the field (setting in solrconfig.xml). 4) Is OCR text or image? The stored fields are compressed by default, I wonder if the compression/decompression of a large image is an issue. 5) JDK 8 apparently makes Lucene much happier (speed of some operations). Might be something to test if all else fails. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
Am 29.04.2014 11:19, schrieb Alexandre Rafalovitch: Couple of random thoughts: 1) The latest (4.8) Solr has support for nested documents, as well as for expand components. Maybe that will let you have more efficient architecture: http://heliosearch.org/expand-block-join/ Yes, I've seen this, but as far as I understood you have to know on which nesting level you do your query. My search should work on any level, say, volume title 1986 chapter 1.1 author marc chapter 1.1.3 title does not matter chapter 1.1.3.1 title abc chapter 1.1.3.2 title xyz should match by querying +author:marc +title:abc // or // +author:marc +title:xyz but // not // +title:abc +title:xyz (we'll have an unkown number of levels) 2) Do you return OCR text to the client? Or just search it? If just search it, you don't need to store it I'll want to get highlighted snippets. 3) If you do need to store it and return it, do you always have to return it? If not, you could look at lazy-loading the field (setting in solrconfig.xml). Let's see, perhaps this is a sorting problem which could be solved by setting field id and sort_... to docValues=ture. 4) Is OCR text or image? The stored fields are compressed by default, I wonder if the compression/decompression of a large image is an issue. Text. 5) JDK 8 apparently makes Lucene much happier (speed of some operations). Might be something to test if all else fails. Ok... Thanks, J. Barth Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Kind regards, J. Barth Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Re: Stored vs non-stored very large text fields
Dear Shawn, see attachment for my first brute force no-compression attempt. Kind regards, Jochen Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java *** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java 2013-11-01 07:03:52.0 +0100 --- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java 2014-04-29 13:58:27.0 +0200 *** *** 38,43 --- 38,44 import org.apache.lucene.codecs.lucene40.Lucene40NormsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; import org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat; + import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentInfo; import org.apache.lucene.store.Directory; *** *** 56,62 @Deprecated public class Lucene41Codec extends Codec { // TODO: slightly evil ! private final StoredFieldsFormat fieldsFormat = new CompressingStoredFieldsFormat(Lucene41StoredFields, CompressionMode.FAST, 1 14) { @Override public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { throw new UnsupportedOperationException(this codec can only be used for reading); --- 57,63 @Deprecated public class Lucene41Codec extends Codec { // TODO: slightly evil ! private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat() { @Override public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { throw new UnsupportedOperationException(this codec can only be used for reading); diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java *** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java 2013-11-01 07:03:52.0 +0100 --- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java 2014-04-29 13:57:08.0 +0200 *** *** 32,38 import org.apache.lucene.codecs.TermVectorsFormat; import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; ! import org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentWriteState; --- 32,38 import org.apache.lucene.codecs.TermVectorsFormat; import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; ! import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentWriteState; *** *** 53,59 // (it writes a minor version, etc). @Deprecated public class Lucene42Codec extends Codec { ! private final StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsFormat(); private final TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat(); private final FieldInfosFormat fieldInfosFormat = new Lucene42FieldInfosFormat(); private final SegmentInfoFormat infosFormat = new Lucene40SegmentInfoFormat(); --- 53,59 // (it writes a minor version, etc). @Deprecated public class Lucene42Codec extends Codec { ! private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat(); private final TermVectorsFormat vectorsFormat
Re: Stored vs non-stored very large text fields
Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Re: Stored vs non-stored very large text fields
Ok, https://wiki.apache.org/solr/SolrPerformanceFactors states that: Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document). But in my case (with docValues=true) there should be no reason to access *.fdt. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn