[
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-510:
--------------------------------------
Attachment: LUCENE-510.take2.patch
New rev of the patch. I think it's ready to commit. I'll wait a few
days.
I made some performance improvements by factoring out a new
UnicodeUtil class that does not allocate new objects for every
conversion to/from UTF8.
One new issue I fixed is the handling of invalid UTF-16 strings.
Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is
unable to represent it (unlike the current modified UTF-8 Lucene
format). I changed DocumentsWriter & UnicodeUtil to substitute the
replacement char U+FFFD for such invalid surrogate characters. This
affects terms, stored String fields and term vectors.
Indexing performance has a small slowdown (3.5%); details are below.
Unfortunately, time to enumerate terms was more affected. I made a
simple test that enumerates all terms from the index (= ~3.3 million
terms) created below:
public class TestTermEnum {
public static void main(String[] args) throws Exception {
IndexReader r = IndexReader.open(args[0]);
TermEnum terms = r.terms();
int count = 0;
long t0 = System.currentTimeMillis();
while(terms.next())
count++;
long t1 = System.currentTimeMillis();
System.out.println(count + " terms in " + (t1-t0) + " millis");
r.close();
}
}
On trunk with current index format this takes 3104 msec (best of 5).
With the patch with UTF8 index format it takes 3443 msec = 10.9%
slower. I don't see any further ways to make this faster.
Details on the indexing performance test:
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = true
doc.term.vector = true
doc.add.log.step=2000
directory=FSDirectory
autocommit=false
compound=false
ram.flush.mb=64
{ "Rounds"
ResetSystemErase
{ "BuildIndex"
CreateIndex
{ "AddDocs" AddDoc > : 200000
- CloseIndex
}
NewRound
} : 5
RepSumByPrefRound BuildIndex
I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array,
running OS 10.4.11, java 1.5, run with these command-line args:
-server -Xbatch -Xms1024m -Xmx1024m
Best of 5 with current trunk is 921.2 docs/sec and with patch it's
888.7 = 3.5% slowdown.
> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
> Key: LUCENE-510
> URL: https://issues.apache.org/jira/browse/LUCENE-510
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Store
> Affects Versions: 2.1
> Reporter: Doug Cutting
> Assignee: Michael McCandless
> Attachments: LUCENE-510.patch, LUCENE-510.take2.patch,
> SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length
> of the string is in bytes, not Java characters. This issue has been
> discussed at:
> http://www.mail-archive.com/[email protected]/msg01970.html
> We must increment the file format number to indicate this change. At least
> the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until
> after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0
> (other than removal of deprecated features).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]