[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781746#action_12781746
 ] 

Robert Muir commented on LUCENE-1606:
-------------------------------------

Yonik, maybe we can use this trick?

UTF-8 in UTF-16 Order
The following comparison function for UTF-8 yields the same results as UTF-16 
binary
comparison. In the code, notice that it is necessary to do extra work only once 
per string,
not once per byte. That work can consist of simply remapping through a small 
array; there
are no extra conditional branches that could slow down the processing.
{code}
int strcmp8like16(unsigned char* a, unsigned char* b) {
  while (true) {
  int ac = *a++;
  int bc = *b++;
  if (ac != bc) return rotate[ac] - rotate[bc];
  if (ac == 0) return 0;
  }
}

static char rotate[256] =
{0x00, ..., 0x0F,
0x10, ..., 0x1F,
. .
. .
. .
0xD0, ..., 0xDF,
0xE0, ..., 0xED, 0xF0, 0xF1,
0xF2, 0xF3, 0xF4, 0xEE, 0xEF, 0xF5, ..., 0xFF};
{code}

The rotate array is formed by taking an array of 256 bytes from 0x00 to 0xFF, 
and rotating
0xEE and 0xEF to a position after the bytes 0xF0..0xF4. These rotated values 
are shown in
boldface. When this rotation is performed on the initial bytes of UTF-8, it has 
the effect of
making code points U+10000..U+10FFFF sort below U+E000..U+FFFF, thus mimicking
the ordering of UTF-16.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
>                 Key: LUCENE-1606
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>      The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>       
>      1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>      2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to