Mike, could you clarify what you meant by the int comment at the end of your last message? I fail to see the significance of having multibyte transition labels for the format of the payloads the automation will run on...
Thanks! Jta On Mon, Oct 2, 2017, 12:41 Cristian Lorenzetto < cristian.lorenze...@gmail.com> wrote: > It sounds a good way :) Maybe the code to develop it is not so huge. Thanks > for the suggestions :) > > 2017-10-02 12:27 GMT+02:00 Michael McCandless <luc...@mikemccandless.com>: > > > I'm not sure this is exactly what you are asking, but Lucene's terms are > > already byte[] (default UTF-8 encoded from char[] terms), and the > automata > > that are created for searching (e.g. by WildcardQuery, PrefixQuery, > > FuzzyQuery, AutomatonQuery) are also byte based (see the crazy > > UTF32ToUTF8.java conversion class). Lucene's Automaton class uses > integer > > labels on the transitions, so as long as you ensure those ints never fall > > outside of an unsigned byte (0-255) then it's byte-based. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <dawid.we...@gmail.com> > > wrote: > > > > > > Preface: I dont know how automaton is implemented deeply inside > > lucene , > > > > > > Well, you can take a look, it's open source. :) There are two > > > different finite state automata inside Lucene: one is pretty much a > > > "read-only" transducer from unique input seqences (of bytes) into an > > > output. This is the FST<?> class. The other is Automaton class which > > > has been ported from the Brics library [1]. > > > > > > I can't really relate to your comment about fast querying for > > > sub-automata; sounds interesting though. Dig in the code and suggest a > > > patch (or even demonstrate what you came up with!). > > > > > > Dawid > > > > > > [1] http://www.brics.dk/automaton/ > > > > > > > but (considering automaton is built on the fly when index is already > > > > present) i imagine that the automaton is scanning the > lexicons/tokens > > > > present in the lucene index for finding the document references > > (solution > > > > 1). > > > > I think there are 2 different generic solutions for using automata > for > > my > > > > opinion. > > > > 1) to create a automaton for parsing the token present in the lucene > > > table > > > > as described above. > > > > 2) to create a pattern matching automaton(on binary, or better of a > > > > abstract stream could be more generic) and put these states directly > > in > > > a > > > > index . In this case you can receive very fastly the documents > > matching a > > > > specific automaton built when you created the index ( or a > > sub-automaton > > > > rappreenting a subset of the same states) . The second solution > could > > > > maybe be used for mapping inside a single lucene document field a > > complex > > > > structure and then you can find nested information embedded . In > this > > > way > > > > i need not to use multiple lucene documents (this could create > > > performance > > > > and scalability problems) > > > > In many cases this solution could be fastest of actual joins for > > example, > > > > be usefull in bioinformatic or all those cases where data is not a > > basic > > > > ADT. > > > > > > > > Cristian > > > > > > > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <dawid.we...@gmail.com>: > > > > > > > >> > Hi , it is possible to create a Automaton in lucene parsing not a > > > string > > > >> > but a byte array? > > > >> > > > >> Can you state what problem are you trying to solve? This seems to > be a > > > >> question stripped of a more general context -- why do you need those > > > >> byte-based automata? > > > >> > > > >> Dawid > > > >> > > > >> > --------------------------------------------------------------------- > > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > >> > > > >> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > -- sent from a phone. please excuse terseness and tpyos. enviado desde un teléfono. por favor disculpe la parquedad y los erroers.