[jira] Updated: (LUCENE-561) ParallelReader fails on deletes and on seeks of previously unused fields

2006-05-02 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-561?page=all ] Chuck Williams updated LUCENE-561: -- Attachment: ParallelReaderBugs.patch This new version of the patch provides a more general solution for the unknown field NPE problem (the solution for

MemoryIndex

2006-05-02 Thread Robert Engels
Along the lines of Lucene-550, what about having a MemoryIndex that accepts multiple documents, then wrote the index once at the end in the Lucene file format (so it could be merged) during close. When adding documents using an IndexWriter, a new segment is created for each document, and then the

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Chuck Williams
Hi Jian, I agree with you about Microsoft. It's a standard ploy to put window dressing on stuff to combat competition, in this case from the open document standard. So the UTF-8 concern is interoperability with other programs at the index level. An interesting question here is whether the

[jira] Updated: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

2006-05-02 Thread kieran (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-436?page=all ] kieran updated LUCENE-436: -- Attachment: Lucene-436-TestCase.tar.gz test case for recreating this issue on (at least) sun jvm hotspot 1.4.2 on linux [PATCH] TermInfosReader, SegmentTermEnum Out Of

[jira] Created: (LUCENE-562) Allow Unstored AND Unindexed Fields as in 1.4

2006-05-02 Thread Sam Hough (JIRA)
Allow Unstored AND Unindexed Fields as in 1.4 - Key: LUCENE-562 URL: http://issues.apache.org/jira/browse/LUCENE-562 Project: Lucene - Java Type: Bug Versions: 1.9 Reporter: Sam Hough Priority: Minor In

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Doug Cutting
Chuck Williams wrote: For lazy fields, there would be a substantial benefit to having the count on a String be an encoded byte count rather than a Java char count, but this has the same problem. If there is a way to beat this problem, then I'd start arguing for a byte count. I think the way

Re: MemoryIndex

2006-05-02 Thread Wolfgang Hoschek
MemoryIndex was designed to maximize performance for a specific use case: pure in-memory datastructure, at most one document per MemoryIndex instance, any number of fields, high frequency reads, high frequency index writes, no thread-safety required, optional support for storing offsets.

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread jian chen
Hi, Doug, I totally agree with what you said. Yeah, I think it is more of a file format issue, less of an API issue. It seems that we just need to add an extra constructor to Term.java to take in utf8 byte array. Lucene 2.0 is going to break the backward compability anyway, right? So, maybe

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Chuck Williams
The benefits to a byte count are substantial, including: 1. Lazy fields can skip strings without reading them, as they do for all other value types. 2. The file format could be changed to standard UTF-8 without any significant performance cost 3. Any other index operation

Re: this == that

2006-05-02 Thread Tatu Saloranta
--- jian chen [EMAIL PROTECTED] wrote: I am wondering if interning Strings will be really that critical for performance. The biggest bottle neck is still disk. So, maybe we can use String.equals(...) instead of ==. I would bet big bucks for it saving significant amount of time, even with

Re: Returning a minimum number of clusters

2006-05-02 Thread carp
Marvin Humphrey wrote: BTW, clustering in Information Retrieval usually implies grouping by vector distance using statistical methods: http://en.wikipedia.org/wiki/Data_clustering In general, all you need is objects with a pairwise similarity (dissimilarity) measure. With (term) vectors,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Tatu Saloranta
--- jian chen [EMAIL PROTECTED] wrote: Plus, as open source and open standard advocates, we don't want to be like Micros$ft, who claims to use industrial standard XML as the next generation word file format. However, it is very hard to write your own Word reader, because their word file

IndexWriter mergeSegments

2006-05-02 Thread Karel Tejnora
Hi, I found a small issue when I add 10GB index to 20GB index using addIndexes when useCompoundFile == true. Before compound file is created the segments info are written but points to non-existing coumpound file then new .tmp is created and renamed to .cfs Between time when new segments