[ https://issues.apache.org/jira/browse/LUCENE-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658088#comment-13658088 ]
David Smiley commented on LUCENE-4583: -------------------------------------- I can understand that an all in-RAM codec has size sensitivities. In that light, I can also understand that 32KB per document is a lot. The _average_ per-document variable byte length size for Barakat's index is a measly 10 bytes. The maximum is around 69k. Likewise for the user Shai referenced on the list who was using it for faceting, it's only the worst-case document(s) that exceeded 32KB. Might the "new PagedBytes(16)" in Lucene42DocValuesProducer.loadBinary() be made configurable? i.e. Make 16 configurable? And/or perhaps make loadBinary() protected so another codec extending this one can keep the change somewhat minimal. Mike, in your latest patch, one improvement that could be made is instead of Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE - 2" (which it technically is _but only by coincidence_), you could instead reference a calculated constant shared with the actual code that has this limit which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant to 2^16-2 but then add an assert in loadBinary that the constant is consistent with the PagedBytes instance's config. Or something like that. bq. David can you open a separate issue about changing the limit for existing codecs? Uh... all the discussion has been here so seems too late to me. And I'm probably done making my arguments. I can't be more convincing than pointing out the 10-byte average figure for my use case. > StraightBytesDocValuesField fails if bytes > 32k > ------------------------------------------------ > > Key: LUCENE-4583 > URL: https://issues.apache.org/jira/browse/LUCENE-4583 > Project: Lucene - Core > Issue Type: Bug > Components: core/index > Affects Versions: 4.0, 4.1, 5.0 > Reporter: David Smiley > Priority: Critical > Fix For: 4.4 > > Attachments: LUCENE-4583.patch, LUCENE-4583.patch, LUCENE-4583.patch, > LUCENE-4583.patch, LUCENE-4583.patch > > > I didn't observe any limitations on the size of a bytes based DocValues field > value in the docs. It appears that the limit is 32k, although I didn't get > any friendly error telling me that was the limit. 32k is kind of small IMO; > I suspect this limit is unintended and as such is a bug. The following > test fails: > {code:java} > public void testBigDocValue() throws IOException { > Directory dir = newDirectory(); > IndexWriter writer = new IndexWriter(dir, writerConfig(false)); > Document doc = new Document(); > BytesRef bytes = new BytesRef((4+4)*4097);//4096 works > bytes.length = bytes.bytes.length;//byte data doesn't matter > doc.add(new StraightBytesDocValuesField("dvField", bytes)); > writer.addDocument(doc); > writer.commit(); > writer.close(); > DirectoryReader reader = DirectoryReader.open(dir); > DocValues docValues = MultiDocValues.getDocValues(reader, "dvField"); > //FAILS IF BYTES IS BIG! > docValues.getSource().getBytes(0, bytes); > reader.close(); > dir.close(); > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org