[ https://issues.apache.org/jira/browse/LUCENE-4889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615122#comment-13615122 ]
Dawid Weiss commented on LUCENE-4889: ------------------------------------- Just pushed a version that doesn't do a table lookup for ASCII. {code} implementation dataType ns linear runtime LUCENE UNICODE 167374240.6 =============== LUCENE ASCII 333944799.0 ============================== LUCENE_MOD1 UNICODE 167449028.1 =============== LUCENE_MOD1 ASCII 77172139.4 ====== JAVA UNICODE 5755140.1 = JAVA ASCII 23.9 = NOLOOKUP_IF UNICODE 90220440.6 ======== NOLOOKUP_IF ASCII 29145155.3 == {code} Cuts the time by 75% but still far above the Java decoder. So it's not a single loop I think, Uwe :) Also: I'm not comparing full decoding, only codepoint counting. I also assumed valid utf8 (since it's what UnicodeUtil does anyway). Finally: I'm not advocating for changing it, I'm just saying it's interesting *by how much* these timings differ. > UnicodeUtil.codePointCount microbenchmarks (wtf) > ------------------------------------------------ > > Key: LUCENE-4889 > URL: https://issues.apache.org/jira/browse/LUCENE-4889 > Project: Lucene - Core > Issue Type: Task > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Trivial > Fix For: 5.0 > > > This is interesting. I posted a link to a state-machine-based UTF8 > parser/recognizer: > http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ > I spent some time thinking if the lookup table could be converted into a > stateless computational function, which would avoid a table lookup (which in > Java will cause an additional bounds check that will be hard to eliminate I > think). This didn't turn out to be easy (it boils down to finding a simple > function that would map a set of integers to its concrete permutation; a > generalization of minimal perfect hashing). > But out of curiosity I though it'd be fun to compare how Lucene's codepoint > counting compares to Java's built-in one (Decoder) and a sequence of if's. > I've put together a Caliper benchmark that processes 50 million unicode > codepoints; one only ASCII, one Unicode. The results are interesting. On my > win/I7: > {code} > implementation dataType ns linear runtime > LUCENE UNICODE 167359502.6 =============== > LUCENE ASCII 334015746.5 ============================== > NOLOOKUP_SWITCH UNICODE 154294141.8 ============= > NOLOOKUP_SWITCH ASCII 119500892.8 ========== > NOLOOKUP_IF UNICODE 90149072.6 ======== > NOLOOKUP_IF ASCII 29151411.4 == > {code} > Disregard the switch lookup -- it's for fun only. But a sequence of if's is > significantly faster than the current Lucene's table lookup, especially on > ASCII input. And now compare this to Java's built-in decoder... > {code} > JAVA UNICODE 5753930.1 = > JAVA ASCII 23.8 = > {code} > Yes, it's the same benchmark. Wtf? I realize buffers are partially native and > probably so is utf8 decoder but by so much?! Again, to put it in context: > {code} > implementation dataType ns linear runtime > LUCENE UNICODE 167359502.6 =============== > LUCENE ASCII 334015746.5 ============================== > JAVA UNICODE 5753930.1 = > JAVA ASCII 23.8 = > NOLOOKUP_IF UNICODE 90149072.6 ======== > NOLOOKUP_IF ASCII 29151411.4 == > NOLOOKUP_SWITCH UNICODE 154294141.8 ============= > NOLOOKUP_SWITCH ASCII 119500892.8 ========== > {code} > Wtf? The code is here if you want to experiment. > https://github.com/dweiss/utf8dfa > I realize the Java version needs to allocate a temporary space buffer but if > these numbers hold for different VMs it may actually be worth it... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org