[ 
https://issues.apache.org/jira/browse/LUCENE-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658088#comment-13658088
 ] 

David Smiley commented on LUCENE-4583:
--------------------------------------

I can understand that an all in-RAM codec has size sensitivities.  In that 
light, I can also understand that 32KB per document is a lot.  The _average_ 
per-document variable byte length size for Barakat's index is a measly 10 
bytes.  The maximum is around 69k.  Likewise for the user Shai referenced on 
the list who was using it for faceting, it's only the worst-case document(s) 
that exceeded 32KB.

Might the "new PagedBytes(16)" in Lucene42DocValuesProducer.loadBinary() be 
made configurable? i.e. Make 16 configurable?  And/or perhaps make loadBinary() 
protected so another codec extending this one can keep the change somewhat 
minimal.

Mike, in your latest patch, one improvement that could be made is instead of 
Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE 
- 2" (which it technically is _but only by coincidence_), you could instead 
reference a calculated constant shared with the actual code that has this limit 
which is Lucene42DocValuesProducer.loadBinary().  For example, set the constant 
to 2^16-2 but then add an assert in loadBinary that the constant is consistent 
with the PagedBytes instance's config.  Or something like that.

bq. David can you open a separate issue about changing the limit for existing 
codecs?

Uh... all the discussion has been here so seems too late to me. And I'm 
probably done making my arguments.  I can't be more convincing than pointing 
out the 10-byte average figure for my use case.
                
> StraightBytesDocValuesField fails if bytes > 32k
> ------------------------------------------------
>
>                 Key: LUCENE-4583
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4583
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0, 4.1, 5.0
>            Reporter: David Smiley
>            Priority: Critical
>             Fix For: 4.4
>
>         Attachments: LUCENE-4583.patch, LUCENE-4583.patch, LUCENE-4583.patch, 
> LUCENE-4583.patch, LUCENE-4583.patch
>
>
> I didn't observe any limitations on the size of a bytes based DocValues field 
> value in the docs.  It appears that the limit is 32k, although I didn't get 
> any friendly error telling me that was the limit.  32k is kind of small IMO; 
> I suspect this limit is unintended and as such is a bug.    The following 
> test fails:
> {code:java}
>   public void testBigDocValue() throws IOException {
>     Directory dir = newDirectory();
>     IndexWriter writer = new IndexWriter(dir, writerConfig(false));
>     Document doc = new Document();
>     BytesRef bytes = new BytesRef((4+4)*4097);//4096 works
>     bytes.length = bytes.bytes.length;//byte data doesn't matter
>     doc.add(new StraightBytesDocValuesField("dvField", bytes));
>     writer.addDocument(doc);
>     writer.commit();
>     writer.close();
>     DirectoryReader reader = DirectoryReader.open(dir);
>     DocValues docValues = MultiDocValues.getDocValues(reader, "dvField");
>     //FAILS IF BYTES IS BIG!
>     docValues.getSource().getBytes(0, bytes);
>     reader.close();
>     dir.close();
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to