John,

Sure you can add identical documents to index if you like. I don't think Lucene requires a unique ID field, only Solr does. Lucene documents have internal doc IDs auto generated when indexing or merging index segments.

If I remember correctly, Lucene 4.1 started doing cross document compression, so if could manage to index similar documents in the same chunk, it may help to reduce your stored fields.

Hope this helps,
Tri

On Feb 19, 2014, at 04:51 AM, John Cecere <john.cec...@oracle.com> wrote:

Thanks Tri. I've tried a variation of the approach you suggested here and it appears to work well. Just one question. Will there be
a problem with adding multiple Document objects to the IndexWriter that have the same field names and values for the StoredFields ?
They all have different TextFields (the content). I've tried doing this and haven't found any problems with it, but I'm just
wondering if there's anything I should be aware of.

Regards,
John

On 2/14/14 4:37 PM, Tri Cao wrote:
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though with
that approach though :)
I do agree that indexing huge documents doesn't seem to have a lot of value, even when you
know a doc is a hit for a certain query, how are you going to display the results to users?
John, for huge data set, it's usually a good idea to roll your own distributed indexes, and model
you data schema very carefully. For example, if you are going to index log files, one reasonable
idea is to make every 5 minutes of logs a document.
Regards,
Tri
On Feb 14, 2014, at 01:20 PM, Glen Newton <glen.new...@gmail.com> wrote:
You should consider making each _line_ of the log file a (Lucene)
document (assuming it is a log-per-line log file)
-Glen
On Fri, Feb 14, 2014 at 4:12 PM, John Cecere <john.cec...@oracle.com <mailto:john.cec...@oracle.com>> wrote:
I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. At
any rate, I don't have control over the size of the documents that go into
my database. Sometimes my customer's log files end up really big. I'm
willing to have huge indexes for these things.
Wouldn't just changing from int to long for the offsets solve the problem ?
I'm sure it would probably have to be changed in a lot of places, but why
impose such a limitation ? Especially since it's using an InputStream and
only dealing with a block of data at a time.
I'll take a look at your suggestion.
Thanks,
John
On 2/14/14 3:20 PM, Michael McCandless wrote:
Hmm, why are you indexing such immense documents?
In 3.x Lucene never sanity checked the offsets, so we would silently
index negative (int overflow'd) offsets into e.g. term vectors.
But in 4.x, we now detect this and throw the exception you're seeing,
because it can lead to index corruption when you index the offsets
into the postings.
If you really must index such enormous documents, maybe you could
create a custom tokenizer (derived from StandardTokenizer) that
"fixes" the offset before setting them? Or maybe just doesn't even
set them.
Note that position can also overflow, if your documents get too large.
Mike McCandless
http://blog.mikemccandless.com
On Fri, Feb 14, 2014 at 1:36 PM, John Cecere <john.cec...@oracle.com <mailto:john.cec...@oracle.com>>
wrote:
I'm having a problem with Lucene 4.5.1. Whenever I attempt to index a
file >
2GB in size, it dies with the following exception:
java.lang.IllegalArgumentException: startOffset must be non-negative, and
endOffset must be >= startOffset,
startOffset=-2147483648,endOffset=-2147483647
Essentially, I'm doing this:
Directory directory = new MMapDirectory(indexPath);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,
analyzer);
IndexWriter iw = new IndexWriter(directory, iwc);
InputStream is = <my input stream>;
InputStreamReader reader = new InputStreamReader(is);
Document doc = new Document();
doc.add(new StoredField("fileid", fileid));
doc.add(new StoredField("pathname", pathname));
doc.add(new TextField("content", reader));
iw.addDocument(doc);
It's the IndexWriter addDocument method that throws the exception. In
looking at the Lucene source code, it appears that the offsets being used
internally are int, which makes it somewhat obvious why this is
happening.
This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectly
capable of handling a file over 2GB in this manner. What has changed and
how
do I get around this ? Is Lucene no longer capable of handling files this
large, or is there some other way I should be doing this ?
Here's the full stack trace sans my code:
java.lang.IllegalArgumentException: startOffset must be non-negative, and
endOffset must be >= startOffset,
startOffset=-2147483648,endOffset=-2147483647
at
org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)
at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)
at
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
at
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)
Thanks,
John
--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com <mailto:john.cec...@oracle.com>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org>
--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com <mailto:john.cec...@oracle.com>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org>

--
John Cecere
Principal Engineer - Oracle Corporation
732-987-4317 / john.cec...@oracle.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to