IndexWriter croaks on large file

John Cecere Fri, 14 Feb 2014 10:37:27 -0800

I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a file > 
2GB in size, it dies with the following exception:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647


Essentially, I'm doing this:

Directory directory = new MMapDirectory(indexPath);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);
IndexWriter iw = new IndexWriter(directory, iwc);

InputStream is = <my input stream>;
InputStreamReader reader = new InputStreamReader(is);

Document doc = new Document();
doc.add(new StoredField("fileid", fileid));
doc.add(new StoredField("pathname", pathname));
doc.add(new TextField("content", reader));

iw.addDocument(doc);

It's the IndexWriter addDocument method that throws the exception. In looking at the Lucene source code, it appears that the offsetsbeing used internally are int, which makes it somewhat obvious why this is happening.

This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly capable of handling a file over 2GB in this manner. What haschanged and how do I get around this ? Is Lucene no longer capable of handling files this large, or is there some other way I shouldbe doing this ?


Here's the full stack trace sans my code:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647

        at 
org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
        at 
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
        at 
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
        at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
        at 
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
        at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
        at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)

Thanks,
John

--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

IndexWriter croaks on large file

Reply via email to