[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698970#comment-13698970 ] Erik Hatcher commented on LUCENE-5014: -- Roman - I see SOLR mentions in the patch, but this is purely at the lucene module level in this patch, right? At least the mentions should be removed, but anything else needs adjusting? Is there Solr QParserPlugin stuff you're contributing as well? ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698981#comment-13698981 ] Roman Chyla commented on LUCENE-5014: - HiErik, i'll add a solr qparser plugin too. thanks for reminding me. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698985#comment-13698985 ] Roman Chyla commented on LUCENE-5014: - will it be OK to include the solr parts in this ticket? besides the jira name, that seems s aa best option to me. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699005#comment-13699005 ] Erik Hatcher commented on LUCENE-5014: -- bq. will it be OK to include the solr parts in this ticket? Seems the best way to do it to me as well. It's probably not more than a few lines of code as a thin shim factory. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699417#comment-13699417 ] Roman Chyla commented on LUCENE-5014: - New addition: solr qparser plugin. It is unfortunately not as easy as one may think, because of various defaults - e.g. user may want to specify different defaultField, whether wildcards are allowed at the beginning, what is the maximum range for proximity values... some of which should be only in solrconfig.xml, and some also in query params. So here is a stab at it, it works, but may require more config options - there is also a new unittest. Only that Ivy mirrors decided to not work now (ughhh) so I could not test solr unittests - ihope it works. Lucene's 'ant test' went fine. If sb wants to try in solr, please make sure you have antlr-runtime.jar in your solr libs and this should go inside solrconfig.xml {code} queryParser name=lucene2 class=AqpLuceneQParserPlugin lst name=defaults str name=defaultFieldtext/str /lst /queryParser {code} ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695149#comment-13695149 ] Roman Chyla commented on LUCENE-5014: - Adding an example, standard lucene grammar extended with NEAR operators (as discussed above) This should illustrate how easy it is to extend/modify/add a new query dialect. Handling of NEAR operators is not at all trivial, so I hope you will have some fun realizing it can be done in two lines ;) {code} setGrammarName(ExtendedLuceneGrammar); ((AqpQueryTreeBuilder) qp.getQueryBuilder()).setBuilder(AqpNearQueryNode.class, new AqpNearQueryNodeBuilder()); {code} Have a look at TestAqpExtendedLGSimple ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908 ] Roman Chyla commented on LUCENE-5014: - Hi David, In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, backtracking,memoization) - see this http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference But it is also capable of doing more things than PEG (ie. better error recovery - PEG parser needs to parse the whole tree before it discovers an error; then the error recovery is not the same thing) PEG's can be easier *especially* because of the first-choice operator; in fact at times I wished that ANTLR just chose the first available option (well, it does, but it reports and error and I didn't want to have grammar with errors). So, in CFGANTLR world, ambiguity is solved using syntactic predicated (lookahead) -- so far, this has been a theoretical, here are few more points: Clarity === I looked at the presentation and the parser contains the operator precedence, however there it is spread across several screens of java code, i find the following much more readable {code} mainQ : clauseOr+ EOF ; clauseOr : clauseAnd (or clauseAnd )* ; clauseAnd : clauseNot (and clauseNot)* ; {code} It is essentially the same thing, but it is independent of the Java and I can see it on few lines - and extend it adding few more lines. The patch I wrote makes the handling of separate grammar and generated code seamless. So the 2/3 advantages of PEG over ANTLR disappear. Syntax vs semantics (business logic) The example from the presentation needs to be much more involved if it is to be used in the real life. Consider this query: {noformat} dog NEAR cat {noformat} This is going to work only in the simplest case, where each term is a single TermQuery. Yet if there was a synonym expansion (where would it go inside the PEG parser, is one question) - the parser needs to *rewrite* the query something like: {noformat} (dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat) {noformat} So, there you get the 'spaghetti problem' - in the example presented, the logic that rewrites the query must reside in the same place as the query parsing. That is not an improvement IMO, it is the same thing as the old Lucene parsers written in JavaCC which are very difficult to extend or debug I think I'll add a new grammar with the proximity operators so that you can see how easy it is to solve the same situation with ANTLR (but you will need to read the patch this time ;)) btw. the patch is big because i included the html with SVG charts of the generated parse trees and one Excel file (that one helps in writing unittest for the grammar) Developer vs user experience I think PEG definitely looks simpler (in the presented example) and its main advantage is the first-choice operator. But since ANTLR can do the same and it has programming language independent grammar, it can do the same job. The difference may be in maturity of the project, tools available (ie debuggers) - and of course implementation (see the link above for details) I can imagine that for PEG you can use your IDE of choice, while with ANTLR there is this 'pesky' level of abstraction - but there are tools that make life bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked that one); grammar unittest and I added ways to debug/view the grammar. Again, I recommend trying it, e.g. {code} ant -f aqp-build.xml gunit # edit StandardLuceneGrammar and save as 'mytestgrammar' ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar {code} There may be of course more things to consider, but I believe the 3 issues above present some interesting vantage points. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667456#comment-13667456 ] David Smiley commented on LUCENE-5014: -- Interesting. Just read your description but didn't look at the 2.5MB patch file ;-) At Lucene Revolution I saw [a cool presentation|http://www.lucenerevolution.org/sites/default/files/Implementing%20a%20Custom%20Search%20Syntax%20using%20Solr%2C%20Lucene%20%26%20Parboiled.pdf] by [~berryman] that showed off using [Parboiled|https://github.com/sirthias/parboiled/wiki], which uses a new innovative approach to making a parser. I was quite impressed by how easy to use it was vs the classic incumbents (specifically ANTLR). I am curious what you think, in relation to your aims in this patch. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org