If you show the automaton with toDot or toString it should be clear where those codepoints come from.

- Anders

On 04-08-2012 02:34, Ashwin Jayaprakash wrote:
Hi, I was playing with the RunAutomaton class and I was not sure about
the meaning of the results returned by the
RunAutomaton.getCharIntervals() method.

The JavaDoc for that method says "Returns array of codepoint class
interval start points.". I tried it on a simple regex string
("ij{2,5}\uE001k789opq") and I couldn't explain why there were4 extra
values returned - 0x3a (:), 0x6c (l), 0x72 (r) and 0xe002 (Unicode
private use codepoint). These 4 characters were +1 step from the
characters 9, k, q and 0xe001 respectively, all of which are in the
regex from which the automaton was built.

Does anyone know why this is happening? All the codepoints in the regex
pattern have a length of just 1 char. So, why the extra chars?

What I was tying to really do was to extract the identifiers in the
pattern, which this method almost does except for some inexplicable,
extra values. I was really looking for an array with "7, 8, 9, i, j, k,
o, p, q, 0xe001".

Code:
   import org.apache.lucene.util.automaton.Automaton;
   import org.apache.lucene.util.automaton.RegExp;
   import org.apache.lucene.util.automaton.RunAutomaton;

   ... ..

       public static void main(String[] args) {
           String s = "ij{2,5}\uE001k789opq";

           RegExp r = new RegExp(s);
           Automaton a = r.toAutomaton();
           RunAutomaton ra = new RunAutomaton(a,
Character.MAX_CODE_POINT, false) {
           };

           System.out.println("Char intervals for: " + s);
           for (int i : ra.getCharIntervals()) {
               System.out.println("  " + Integer.toHexString(i) + " = "
+ new String(Character.toChars(i)));
           }
       }

Output:
   Char intervals for: ij{2,5}?k789opq
     0 =
     37 = 7
     38 = 8
     39 = 9
     3a = :
     69 = i
     6a = j
     6b = k
     6c = l
     6f = o
     70 = p
     71 = q
     72 = r
     e001 = ?
     e002 = ?


Thanks,
Ashwin.


--
Anders Moeller
[email protected]
http://cs.au.dk/~amoeller

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to