[jira] [Commented] (LUCENE-4889) UnicodeUtil.codePointCount microbenchmarks (wtf)

Dawid Weiss (JIRA) Wed, 27 Mar 2013 04:17:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615122#comment-13615122
 ]


Dawid Weiss commented on LUCENE-4889:
-------------------------------------

Just pushed a version that doesn't do a table lookup for ASCII.
{code}
implementation dataType          ns linear runtime
        LUCENE  UNICODE 167374240.6 ===============
        LUCENE    ASCII 333944799.0 ==============================
   LUCENE_MOD1  UNICODE 167449028.1 ===============
   LUCENE_MOD1    ASCII  77172139.4 ======
          JAVA  UNICODE   5755140.1 =
          JAVA    ASCII        23.9 =
   NOLOOKUP_IF  UNICODE  90220440.6 ========
   NOLOOKUP_IF    ASCII  29145155.3 ==
{code}

Cuts the time by 75% but still far above the Java decoder. So it's not a single 
loop I think, Uwe :)

Also: I'm not comparing full decoding, only codepoint counting. I also assumed 
valid utf8 (since it's what UnicodeUtil does anyway). Finally: I'm not 
advocating for changing it, I'm just saying it's interesting *by how much* 
these timings differ.

                
> UnicodeUtil.codePointCount microbenchmarks (wtf)
> ------------------------------------------------
>
>                 Key: LUCENE-4889
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4889
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Trivial
>             Fix For: 5.0
>
>
> This is interesting. I posted a link to a state-machine-based UTF8 
> parser/recognizer:
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
> I spent some time thinking if the lookup table could be converted into a 
> stateless computational function, which would avoid a table lookup (which in 
> Java will cause an additional bounds check that will be hard to eliminate I 
> think). This didn't turn out to be easy (it boils down to finding a simple 
> function that would map a set of integers to its concrete permutation; a 
> generalization of minimal perfect hashing).
> But out of curiosity I though it'd be fun to compare how Lucene's codepoint 
> counting compares to Java's built-in one (Decoder) and a sequence of if's.
> I've put together a Caliper benchmark that processes 50 million unicode 
> codepoints; one only ASCII, one Unicode. The results are interesting. On my 
> win/I7:
> {code}
>  implementation dataType          ns linear runtime
>          LUCENE  UNICODE 167359502.6 ===============
>          LUCENE    ASCII 334015746.5 ==============================
> NOLOOKUP_SWITCH  UNICODE 154294141.8 =============
> NOLOOKUP_SWITCH    ASCII 119500892.8 ==========
>     NOLOOKUP_IF  UNICODE  90149072.6 ========
>     NOLOOKUP_IF    ASCII  29151411.4 ==
> {code}
> Disregard the switch lookup -- it's for fun only. But a sequence of if's is 
> significantly faster than the current Lucene's table lookup, especially on 
> ASCII input. And now compare this to Java's built-in decoder...
> {code}
>            JAVA  UNICODE   5753930.1 =
>            JAVA    ASCII        23.8 =
> {code}
> Yes, it's the same benchmark. Wtf? I realize buffers are partially native and 
> probably so is utf8 decoder but by so much?! Again, to put it in context:
> {code}
>  implementation dataType          ns linear runtime
>          LUCENE  UNICODE 167359502.6 ===============
>          LUCENE    ASCII 334015746.5 ==============================
>            JAVA  UNICODE   5753930.1 =
>            JAVA    ASCII        23.8 =
>     NOLOOKUP_IF  UNICODE  90149072.6 ========
>     NOLOOKUP_IF    ASCII  29151411.4 ==
> NOLOOKUP_SWITCH  UNICODE 154294141.8 =============
> NOLOOKUP_SWITCH    ASCII 119500892.8 ==========
> {code}
> Wtf? The code is here if you want to experiment.
> https://github.com/dweiss/utf8dfa
> I realize the Java version needs to allocate a temporary space buffer but if 
> these numbers hold for different VMs it may actually be worth it...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4889) UnicodeUtil.codePointCount microbenchmarks (wtf)

Reply via email to