[ http://issues.apache.org/jira/browse/LUCENE-561?page=all ]
Chuck Williams updated LUCENE-561:
--
Attachment: ParallelReaderBugs.patch
This new version of the patch provides a more general solution for the unknown
field NPE problem (the solution for
Along the lines of Lucene-550, what about having a MemoryIndex that accepts
multiple documents, then wrote the index once at the end in the Lucene file
format (so it could be merged) during close.
When adding documents using an IndexWriter, a new segment is created for
each document, and then the
Hi Jian,
I agree with you about Microsoft. It's a standard ploy to put window
dressing on stuff to combat competition, in this case from the open
document standard.
So the UTF-8 concern is interoperability with other programs at the
index level. An interesting question here is whether the
[ http://issues.apache.org/jira/browse/LUCENE-436?page=all ]
kieran updated LUCENE-436:
--
Attachment: Lucene-436-TestCase.tar.gz
test case for recreating this issue on (at least) sun jvm hotspot 1.4.2 on linux
[PATCH] TermInfosReader, SegmentTermEnum Out Of
Allow Unstored AND Unindexed Fields as in 1.4
-
Key: LUCENE-562
URL: http://issues.apache.org/jira/browse/LUCENE-562
Project: Lucene - Java
Type: Bug
Versions: 1.9
Reporter: Sam Hough
Priority: Minor
In
Chuck Williams wrote:
For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem. If there is a way to beat this
problem, then I'd start arguing for a byte count.
I think the way
MemoryIndex was designed to maximize performance for a specific use
case: pure in-memory datastructure, at most one document per
MemoryIndex instance, any number of fields, high frequency reads,
high frequency index writes, no thread-safety required, optional
support for storing offsets.
Hi, Doug,
I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.
Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe
The benefits to a byte count are substantial, including:
1. Lazy fields can skip strings without reading them, as they do for
all other value types.
2. The file format could be changed to standard UTF-8 without any
significant performance cost
3. Any other index operation
--- jian chen [EMAIL PROTECTED] wrote:
I am wondering if interning Strings will be really
that critical for
performance. The biggest bottle neck is still disk.
So, maybe we can use
String.equals(...) instead of ==.
I would bet big bucks for it saving significant amount
of time, even with
Marvin Humphrey wrote:
BTW, clustering in Information Retrieval usually implies grouping by
vector distance using statistical methods:
http://en.wikipedia.org/wiki/Data_clustering
In general, all you need is objects with
a pairwise similarity (dissimilarity) measure.
With (term) vectors,
--- jian chen [EMAIL PROTECTED] wrote:
Plus, as open source and open standard advocates, we
don't want to be like
Micros$ft, who claims to use industrial standard
XML as the next
generation word file format. However, it is very
hard to write your own Word
reader, because their word file
Hi,
I found a small issue when I add 10GB index to 20GB index using
addIndexes when useCompoundFile == true.
Before compound file is created the segments info are written but points
to non-existing coumpound file then new .tmp is created and renamed to .cfs
Between time when new segments
13 matches
Mail list logo