It sounds a good way :) Maybe the code to develop it is not so huge. Thanks for the suggestions :)
2017-10-02 12:27 GMT+02:00 Michael McCandless <luc...@mikemccandless.com>: > I'm not sure this is exactly what you are asking, but Lucene's terms are > already byte[] (default UTF-8 encoded from char[] terms), and the automata > that are created for searching (e.g. by WildcardQuery, PrefixQuery, > FuzzyQuery, AutomatonQuery) are also byte based (see the crazy > UTF32ToUTF8.java conversion class). Lucene's Automaton class uses integer > labels on the transitions, so as long as you ensure those ints never fall > outside of an unsigned byte (0-255) then it's byte-based. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <dawid.we...@gmail.com> > wrote: > > > > Preface: I dont know how automaton is implemented deeply inside > lucene , > > > > Well, you can take a look, it's open source. :) There are two > > different finite state automata inside Lucene: one is pretty much a > > "read-only" transducer from unique input seqences (of bytes) into an > > output. This is the FST<?> class. The other is Automaton class which > > has been ported from the Brics library [1]. > > > > I can't really relate to your comment about fast querying for > > sub-automata; sounds interesting though. Dig in the code and suggest a > > patch (or even demonstrate what you came up with!). > > > > Dawid > > > > [1] http://www.brics.dk/automaton/ > > > > > but (considering automaton is built on the fly when index is already > > > present) i imagine that the automaton is scanning the lexicons/tokens > > > present in the lucene index for finding the document references > (solution > > > 1). > > > I think there are 2 different generic solutions for using automata for > my > > > opinion. > > > 1) to create a automaton for parsing the token present in the lucene > > table > > > as described above. > > > 2) to create a pattern matching automaton(on binary, or better of a > > > abstract stream could be more generic) and put these states directly > in > > a > > > index . In this case you can receive very fastly the documents > matching a > > > specific automaton built when you created the index ( or a > sub-automaton > > > rappreenting a subset of the same states) . The second solution could > > > maybe be used for mapping inside a single lucene document field a > complex > > > structure and then you can find nested information embedded . In this > > way > > > i need not to use multiple lucene documents (this could create > > performance > > > and scalability problems) > > > In many cases this solution could be fastest of actual joins for > example, > > > be usefull in bioinformatic or all those cases where data is not a > basic > > > ADT. > > > > > > Cristian > > > > > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <dawid.we...@gmail.com>: > > > > > >> > Hi , it is possible to create a Automaton in lucene parsing not a > > string > > >> > but a byte array? > > >> > > >> Can you state what problem are you trying to solve? This seems to be a > > >> question stripped of a more general context -- why do you need those > > >> byte-based automata? > > >> > > >> Dawid > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >