[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless (JIRA) Mon, 17 Mar 2008 13:06:22 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-510:
--------------------------------------

    Attachment: LUCENE-510.take2.patch

New rev of the patch.  I think it's ready to commit.  I'll wait a few
days.

I made some performance improvements by factoring out a new
UnicodeUtil class that does not allocate new objects for every
conversion to/from UTF8.

One new issue I fixed is the handling of invalid UTF-16 strings.
Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is
unable to represent it (unlike the current modified UTF-8 Lucene
format).  I changed DocumentsWriter & UnicodeUtil to substitute the
replacement char U+FFFD for such invalid surrogate characters.  This
affects terms, stored String fields and term vectors.

Indexing performance has a small slowdown (3.5%); details are below.

Unfortunately, time to enumerate terms was more affected.  I made a
simple test that enumerates all terms from the index (= ~3.3 million
terms) created below:

  public class TestTermEnum {
    public static void main(String[] args) throws Exception {
      IndexReader r = IndexReader.open(args[0]);
      TermEnum terms = r.terms();
      int count = 0;
      long t0 = System.currentTimeMillis();
      while(terms.next())
        count++;
      long t1 = System.currentTimeMillis();
      System.out.println(count + " terms in " + (t1-t0) + " millis");
      r.close();
    }
  }

On trunk with current index format this takes 3104 msec (best of 5).
With the patch with UTF8 index format it takes 3443 msec = 10.9%
slower.  I don't see any further ways to make this faster.

Details on the indexing performance test:

  analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
  
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  
  docs.file=/Volumes/External/lucene/wiki.txt
  doc.stored = true
  doc.term.vector = true
  doc.add.log.step=2000
  
  directory=FSDirectory
  autocommit=false
  compound=false
  
  ram.flush.mb=64
  
  { "Rounds"
    ResetSystemErase
    { "BuildIndex"
      CreateIndex
      { "AddDocs" AddDoc > : 200000
      - CloseIndex
    }
    NewRound
  } : 5
  
  RepSumByPrefRound BuildIndex

I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array,
running OS 10.4.11, java 1.5, run with these command-line args:

  -server -Xbatch -Xms1024m -Xmx1024m

Best of 5 with current trunk is 921.2 docs/sec and with patch it's
888.7 = 3.5% slowdown.



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, 
> SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length 
> of the string is in bytes, not Java characters.  This issue has been 
> discussed at:
> http://www.mail-archive.com/[email protected]/msg01970.html
> We must increment the file format number to indicate this change.  At least 
> the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until 
> after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 
> (other than removal of deprecated features).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Reply via email to