[jira] [Updated] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method

2016-10-06 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-7481:

Description: 
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

```
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
```

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x


  was:
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

{code:java}
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
{/code}

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x



> SpanPayloadCheckQuery is missing rewrite method
> ---
>
> Key: LUCENE-7481
> URL: https://issues.apache.org/jira/browse/LUCENE-7481
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.x
>Reporter: Roman Chyla
>
> If used with a wildcard query, the result is a failure saying: "Rewrite query 
> first"
> The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
> just returns the query itself. 
> this works:
> ```
> spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
> SpanMultiTermQueryWrapper(vectrfield:m*), 
> SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
> ```
> code to generate the query:
> ```
> private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
> throws UnsupportedEncodingException {
>   SpanQuery[] clauses = new SpanQuery[howMany+1];
>   clauses[0] = new SpanTermQuery(new Term("vectrfield", 
> parts[0])); // surname
>   for (int i = 0; i < howMany; i++) {
>   if (truncate) {
> SpanMultiTermQueryWrapper q = new 
> 

[jira] [Updated] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method

2016-10-06 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-7481:

Description: 
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

{code:java}
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
{/code}

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x


  was:
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

```
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
```

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x



> SpanPayloadCheckQuery is missing rewrite method
> ---
>
> Key: LUCENE-7481
> URL: https://issues.apache.org/jira/browse/LUCENE-7481
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.x
>Reporter: Roman Chyla
>
> If used with a wildcard query, the result is a failure saying: "Rewrite query 
> first"
> The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
> just returns the query itself. 
> this works:
> ```
> spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
> SpanMultiTermQueryWrapper(vectrfield:m*), 
> SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
> ```
> code to generate the query:
> ```
> private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
> throws UnsupportedEncodingException {
>   SpanQuery[] clauses = new SpanQuery[howMany+1];
>   clauses[0] = new SpanTermQuery(new Term("vectrfield", 
> parts[0])); // surname
>   for (int i = 0; i < howMany; i++) {
>   if (truncate) {
> SpanMultiTermQueryWrapper q = new 
> 

[jira] [Created] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method

2016-10-06 Thread Roman Chyla (JIRA)
Roman Chyla created LUCENE-7481:
---

 Summary: SpanPayloadCheckQuery is missing rewrite method
 Key: LUCENE-7481
 URL: https://issues.apache.org/jira/browse/LUCENE-7481
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 6.x
Reporter: Roman Chyla


If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

```
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
```

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without enablePositionIncrements="false"

2016-09-22 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514785#comment-15514785
 ] 

Roman Chyla commented on SOLR-6468:
---

Ha! :-)
I've found my own comment above, 2 years later I'm facing this situation again, 
I completely forgot (and truth be told: preferred running old solr 4x).

This is how the new solr sees things:

A 350-MHz GBT Survey of 50 Faint Fermi γ ray Sources for Radio Millisecond 
Pulsars

is indexed as
```
null_1
1   :350|350mhz
2   :mhz|syn::mhz
3   :acr::gbt|gbt|syn::gbt|syn::green bank telescope
4   :survey|syn::survey
null_1
6   :50
```

the 1st and 5th position is a gap - so the search for "350-MHz GBT Survey of 50 
Faint" will fail - because 'of' is a stopword and the stop-filter will always 
increment the position (what's the purpose of a stopfilter; if it is leaving 
gaps?)

anyways, the solution with CharFilterFactory cannot work for me, I have to do 
this:
 
 1. search for synonyms (they can contain stopwords)
 2. remove stopwords
 3. search for other synonyms (that don't have stopwords)

I'm afraid the real life is little bit more complex than what it seems; but 
there is a logic to your choices, SOLR devs, I'm afraid I can agree with you. 
People who understand the *why* will make it work again as it *should*. Others 
will happily keep using the 'simplified' version.

> Regression: StopFilterFactory doesn't work properly without 
> enablePositionIncrements="false"
> 
>
> Key: SOLR-6468
> URL: https://issues.apache.org/jira/browse/SOLR-6468
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.8.1, 4.9
>Reporter: Alexander S.
>
> Setup:
> * Schema version is 1.5
> * Field config:
> {code}
>  autoGeneratePhraseQueries="true">
>   
> 
>  ignoreCase="true" />
> 
>   
> 
> {code}
> * Stop words:
> {code}
> http 
> https 
> ftp 
> www
> {code}
> So very simple. In the index I have:
> * twitter.com/testuser
> All these queries do match:
> * twitter.com/testuser
> * com/testuser
> * testuser
> But none of these does:
> * https://twitter.com/testuser
> * https://www.twitter.com/testuser
> * www.twitter.com/testuser
> Debug output shows:
> "parsedquery_toString": "+(url_words_ngram:\"? twitter com testuser\")"
> But we need:
> "parsedquery_toString": "+(url_words_ngram:\"twitter com testuser\")"
> Complete debug outputs:
> * a valid search: 
> http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
> * an invalid search: 
> http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww
> The complete discussion and explanation of the problem is here: 
> http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html
> I didn't find a clear explanation how can we upgrade Solr, there's no any 
> replacement or a workarround to this, so this is not just a major change but 
> a major disrespect to all existing Solr users who are using this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without enablePositionIncrements=false

2014-11-25 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225186#comment-14225186
 ] 

Roman Chyla commented on SOLR-6468:
---

I also find this change to be unfortunate. If this is just a developers making 
decisions for users (then it causes problems to users who really know why they 
do need that feature: for phrase search that should ignore stopwords). But if 
the underlying issue is something serious with the indexer not being able to 
work with the position, than it would be even weirder - and actually very bad 
for many users. I don't really understand benefits of this change. Any chance 
to return to the original?

 Regression: StopFilterFactory doesn't work properly without 
 enablePositionIncrements=false
 

 Key: SOLR-6468
 URL: https://issues.apache.org/jira/browse/SOLR-6468
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.8.1, 4.9
Reporter: Alexander S.

 Setup:
 * Schema version is 1.5
 * Field config:
 {code}
 fieldType name=words_ngram class=solr.TextField omitNorms=false 
 autoGeneratePhraseQueries=true
   analyzer
 tokenizer class=solr.PatternTokenizerFactory pattern=[^\w]+ /
 filter class=solr.StopFilterFactory words=url_stopwords.txt 
 ignoreCase=true /
 filter class=solr.LowerCaseFilterFactory /
   /analyzer
 /fieldType
 {code}
 * Stop words:
 {code}
 http 
 https 
 ftp 
 www
 {code}
 So very simple. In the index I have:
 * twitter.com/testuser
 All these queries do match:
 * twitter.com/testuser
 * com/testuser
 * testuser
 But none of these does:
 * https://twitter.com/testuser
 * https://www.twitter.com/testuser
 * www.twitter.com/testuser
 Debug output shows:
 parsedquery_toString: +(url_words_ngram:\? twitter com testuser\)
 But we need:
 parsedquery_toString: +(url_words_ngram:\twitter com testuser\)
 Complete debug outputs:
 * a valid search: 
 http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
 * an invalid search: 
 http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww
 The complete discussion and explanation of the problem is here: 
 http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html
 I didn't find a clear explanation how can we upgrade Solr, there's no any 
 replacement or a workarround to this, so this is not just a major change but 
 a major disrespect to all existing Solr users who are using this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698981#comment-13698981
 ] 

Roman Chyla commented on LUCENE-5014:
-

HiErik, i'll add a solr qparser plugin too. thanks for reminding me. 

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698985#comment-13698985
 ] 

Roman Chyla commented on LUCENE-5014:
-

will it be OK to include the solr parts in this ticket? besides the jira name, 
that seems s aa best option to me.

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699417#comment-13699417
 ] 

Roman Chyla commented on LUCENE-5014:
-

New addition: solr qparser plugin. 

It is unfortunately not as easy as one may think, because of various defaults - 
e.g. user may want to specify different defaultField, whether wildcards are 
allowed at the beginning, what is the maximum range for proximity values... 
some of which should be only in solrconfig.xml, and some also in query params. 

So here is a stab at it, it works, but may require more config options - there 
is also a new unittest. Only that Ivy mirrors decided to not work now (ughhh) 
so I could not test solr unittests - ihope it works. Lucene's 'ant test' went 
fine. 

If sb wants to try in solr, please make sure you have antlr-runtime.jar in your 
solr libs and this should go inside solrconfig.xml

{code}
queryParser name=lucene2 class=AqpLuceneQParserPlugin
lst name=defaults
   str name=defaultFieldtext/str
/lst
  /queryParser
{code}


 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

Added solr qparserplugin

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-06-28 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

The patch that *actually* contains the extended parser with NEAR operator 
support

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-06-27 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695149#comment-13695149
 ] 

Roman Chyla commented on LUCENE-5014:
-

Adding an example, standard lucene grammar extended with NEAR operators (as 
discussed above)

This should illustrate how easy it is to extend/modify/add a new query dialect. 
Handling of NEAR operators is not at all trivial, so I hope you will have some 
fun realizing it can be done in two lines ;)


{code}
setGrammarName(ExtendedLuceneGrammar);
((AqpQueryTreeBuilder) qp.getQueryBuilder()).setBuilder(AqpNearQueryNode.class, 
new AqpNearQueryNodeBuilder());
{code}

Have a look at TestAqpExtendedLGSimple

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-06-27 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

The same patch + lucene grammar extended with NEARx operator

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-05-27 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908
 ] 

Roman Chyla commented on LUCENE-5014:
-

Hi David,
In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, 
backtracking,memoization) - see this 
http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference

But it is also capable of doing more things than PEG (ie. better error recovery 
- PEG parser needs to parse the whole tree before it discovers an error; then 
the error recovery is not the same thing)

PEG's can be easier *especially* because of the first-choice operator; in fact 
at times I wished that ANTLR just chose the first available option (well, it 
does, but it reports and error and I didn't want to have grammar with errors). 
So, in CFGANTLR world, ambiguity is solved using syntactic predicated 
(lookahead) -- so far, this has been a theoretical, here are few more points:

Clarity
===

I looked at the presentation and the parser contains the operator precedence, 
however there it is spread across several screens of java code, i find the 
following much more readable

{code}
mainQ : 
  clauseOr+ EOF
  ;
  
clauseOr
  : clauseAnd (or clauseAnd )*
  ;

clauseAnd
  : clauseNot  (and clauseNot)*
  ; 
{code}
  
It is essentially the same thing, but it is independent of the Java and I can 
see it on few lines - and extend it adding few more lines. The patch I wrote 
makes the handling of separate grammar and generated code seamless. So the 2/3 
advantages of PEG over ANTLR disappear.


Syntax vs semantics (business logic)


The example from the presentation needs to be much more involved if it is to be 
used in the real life. Consider this query:

{noformat}
dog NEAR cat
{noformat}

This is going to work only in the simplest case, where each term is a single 
TermQuery. Yet if there was a synonym expansion (where would it go inside the 
PEG parser, is one question) - the parser needs to *rewrite* the query 

something like:

{noformat}
(dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat)
{noformat}

So, there you get the 'spaghetti problem' - in the example presented, the logic 
that rewrites the query must reside in the same place as the query parsing. 
That is not an improvement IMO, it is the same thing as the old Lucene parsers 
written in JavaCC which are very difficult to extend or debug

I think I'll add a new grammar with the proximity operators so that you can see 
how easy it is to solve the same situation with ANTLR (but you will need to 
read the patch this time ;)) btw. the patch is big because i included the html 
with SVG charts of the generated parse trees and one Excel file (that one helps 
in writing unittest for the grammar)

Developer vs user experience


I think PEG definitely looks simpler (in the presented example) and its main 
advantage is the first-choice operator. But since ANTLR can do the same and it 
has programming language independent grammar, it can do the same job. The 
difference may be in maturity of the project, tools available (ie debuggers) - 
and of course implementation (see the link above for details)

I can imagine that for PEG you can use your IDE of choice, while with ANTLR 
there is this 'pesky' level of abstraction - but there are tools that make life 
bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked 
that one); grammar unittest and I added ways to debug/view the grammar. Again, 
I recommend trying it, e.g. 

{code}
ant -f aqp-build.xml gunit
# edit StandardLuceneGrammar and save as 'mytestgrammar'
ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar
{code}


There may be of course more things to consider, but I believe the 3 issues 
above present some interesting vantage points.

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it 

[jira] [Comment Edited] (LUCENE-5014) ANTLR Lucene query parser

2013-05-27 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908
 ] 

Roman Chyla edited comment on LUCENE-5014 at 5/27/13 7:04 PM:
--

Hi David,
In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, 
backtracking,memoization) - see this 
http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference

But it is also capable of doing more things than PEG (ie. better error recovery 
- PEG parser needs to parse the whole tree before it discovers an error; then 
the error recovery is not the same thing)

PEG's can be easier *especially* because of the first-choice operator; in fact 
at times I wished that ANTLR just chose the first available option (well, it 
does, but it reports and error and I didn't want to have grammar with errors). 
So, in CFGANTLR world, ambiguity is solved using syntactic predicates 
(lookahead) -- so far, this has been a theoretical, here are few more points:

Grammar vs code
===

I looked at the presentation and the parser contains the operator precedence, 
however there it is spread across several screens of java code, i find the 
following much more readable

{code}
mainQ : 
  clauseOr+ EOF
  ;
  
clauseOr
  : clauseAnd (or clauseAnd )*
  ;

clauseAnd
  : clauseNot  (and clauseNot)*
  ; 
{code}
  
It is essentially the same thing, but it is independent of the Java and I can 
see it on few lines - and extend it adding few more lines. The patch I wrote 
makes the handling of separate grammar and generated code seamless. So the 2/3 
advantages of PEG over ANTLR disappear.


Syntax vs semantics (business logic)


The example from the presentation needs to be much more involved if it is to be 
used in the real life. Consider this query:

{noformat}
dog NEAR cat
{noformat}

This is going to work only in the simplest case, where each term is a single 
TermQuery. Yet if there was a synonym expansion (where would it go inside the 
PEG parser, is one question) - the parser needs to *rewrite* the query 

something like:

{noformat}
(dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat)
{noformat}

So, there you get the 'spaghetti problem' - in the example presented, the logic 
that rewrites the query must reside in the same place as the query parsing. 
That is not an improvement IMO, it is the same thing as the old Lucene parsers 
written in JavaCC which are very difficult to extend or debug

I think I'll add a new grammar with the proximity operators so that you can see 
how easy it is to solve the same situation with ANTLR (but you will need to 
read the patch this time ;)) btw. the patch is big because i included the html 
with SVG charts of the generated parse trees and one Excel file (that one helps 
in writing unittest for the grammar)


Developer vs user experience


I think PEG definitely looks simpler to developers (in the presented example) 
and its main advantage is the first-choice operator. But since ANTLR can do the 
same and it has programming language independent grammar, it can do the same 
job. The difference may be in maturity of the project, tools available (ie 
debuggers) - and of course implementation (see the link above for details)

I can imagine that for PEG you can use your IDE of choice, while with ANTLR 
there is this 'pesky' level of abstraction - but there are tools that make life 
bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked 
that one); grammar unittest and I added ways to debug/view the grammar. If you 
apply the patch, you can try:

{code}
ant -f aqp-build.xml gunit
# edit StandardLuceneGrammar and save as 'mytestgrammar'
ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar
{code}


There may be of course more things to consider, but I believe the 3 issues 
above present some interesting vantage points.

  was (Author: rchyla):
Hi David,
In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, 
backtracking,memoization) - see this 
http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference

But it is also capable of doing more things than PEG (ie. better error recovery 
- PEG parser needs to parse the whole tree before it discovers an error; then 
the error recovery is not the same thing)

PEG's can be easier *especially* because of the first-choice operator; in fact 
at times I wished that ANTLR just chose the first available option (well, it 
does, but it reports and error and I didn't want to have grammar with errors). 
So, in CFGANTLR world, ambiguity is solved using syntactic predicated 
(lookahead) -- so far, this has been a theoretical, here are few more points:

Clarity
===

I looked at the presentation and the parser contains 

[jira] [Created] (LUCENE-5014) ANTLR Lucene query parser

2013-05-22 Thread Roman Chyla (JIRA)
Roman Chyla created LUCENE-5014:
---

 Summary: ANTLR Lucene query parser
 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla


I would like to propose a new way of building query parsers for Lucene.  
Currently, most Lucene parsers are hard to extend because they are either 
written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 
'married' with the query building logic (i.e. the standard lucene parser, 
generated by JavaCC) - which makes any extension really hard.


Few years back, Lucene got the contrib/modern query parser (later renamed to 
'flexible'), yet that parser didn't become a star (it must be very confusing 
for many users). However, that parsing framework is very powerful! And it is a 
real pity that there aren't more parsers already using it - because it allows 
us to add/extend/change almost any aspect of the query parsing. 

So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
framework for building almost any query language one can think of. And I hope 
this extension can become useful.

The details:

 - every new query syntax is written in EBNF, it lives in separate files (and 
can be tested/developed independently - using 'gunit')
 - ANTLR parser generates parsing code (and it can generate parsers in several 
languages, the main target is Java, but it can also do Python - which may be 
interesting for pylucene)
 - the parser generates AST (abstract syntax tree) which is consumed by a  
'pipeline' of processors, users can easily modify this pipeline to add a 
desired functionality
 - the new parser contains a few (very important) debugging functions; it can 
print results of every stage of the build, generate AST's as graphical charts; 
ant targets help to build/test/debug grammars
 - I've tried to reuse the existing queryparser.flexible components as much as 
possible, only adding new processors when necessary

Assumptions about the grammar:
 - every grammar must have one top parse rule called 'mainQ'
 - parsers must generate AST (Abstract Syntax Tree)

The structure of the AST is left open, there are components which make 
assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
FIELD) however users are free to choose/write different processors with 
different assumptions about the AST shape.



More documentation on how to use the parser can be seen here:

http://29min.wordpress.com/category/antlrqueryparser/


The parser has been created more than one year back and is used in production 
(http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
languages (with proximity operatos, functions, special logic etc) - can be seen 
here: 

https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
https://github.com/romanchyla/montysolr/tree/master/contrib/invenio




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-05-22 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

Patch without binary files (if possible, use the other patch)

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-05-22 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

Includes binary files (ie. one jar and xls)

svn diff --force --diff-cmd /usr/bin/diff -x -au  LUCENE-5014.txt

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)
Roman Chyla created LUCENE-4679:
---

 Summary: LowercaseExpandedTermsQueryNodeProcessor changes regex 
queries
 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial


This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, \\W should stay uppercase, 
but it will be lowercased.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4679:


Attachment: LUCENE-4679.patch

 LowercaseExpandedTermsQueryNodeProcessor changes regex queries
 --

 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial
 Attachments: LUCENE-4679.patch


 This is really a very silly request, but could the lowercase processor 
 'abstain' from changing regex queries? For example, \\W should stay 
 uppercase, but it will be lowercased.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4679:


Description: 
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, W should stay 
uppercase, but it will be lowercased.





  was:
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, \\W should stay uppercase, 
but it will be lowercased.






 LowercaseExpandedTermsQueryNodeProcessor changes regex queries
 --

 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial
 Attachments: LUCENE-4679.patch


 This is really a very silly request, but could the lowercase processor 
 'abstain' from changing regex queries? For example, W should stay 
 uppercase, but it will be lowercased.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4679:


Description: 
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, W should stay 
uppercase, but it is lowercased.





  was:
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, W should stay 
uppercase, but it will be lowercased.






 LowercaseExpandedTermsQueryNodeProcessor changes regex queries
 --

 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial
 Attachments: LUCENE-4679.patch


 This is really a very silly request, but could the lowercase processor 
 'abstain' from changing regex queries? For example, W should stay 
 uppercase, but it is lowercased.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)

2012-12-04 Thread Roman Chyla (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4499:


Attachment: LUCENE-4499.patch

A new patch, as the old version was extending wrong class (which cause web 
tests to fail)

 Multi-word synonym filter (synonym expansion)
 -

 Key: LUCENE-4499
 URL: https://issues.apache.org/jira/browse/LUCENE-4499
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Affects Versions: 4.1, 5.0
Reporter: Roman Chyla
Priority: Minor
  Labels: analysis, multi-word, synonyms
 Fix For: 5.0

 Attachments: LUCENE-4499.patch, LUCENE-4499.patch


 I apologize for bringing the multi-token synonym expansion up again. There is 
 an old, unresolved issue at LUCENE-1622 [1]
 While solving the problem for our needs [2], I discovered that the current 
 SolrSynonym parser (and the wonderful FTS) have almost everything to 
 satisfactorily handle both the query and index time synonym expansion. It 
 seems that people often need to use the synonym filter *slightly* differently 
 at indexing and query time.
 In our case, we must do different things during indexing and querying.
 Example sentence: Mirrors of the Hubble space telescope pointed at XA5
 This is what we need (comma marks position bump):
 indexing: mirrors,hubble|hubble space 
 telescope|hst,space,telescope,pointed,xa5|astroobject#5
 querying: +mirrors +(hubble space telescope | hst) +pointed 
 +(xa5|astroboject#5)
 This translated to following needs:
   indexing time: 
 single-token synonyms = return only synonyms
 multi-token synonyms = return original tokens *AND* the synonyms
   query time:
 single-token: return only synonyms (but preserve case)
 multi-token: return only synonyms
  
 We need the original tokens for the proximity queries, if we indexed 'hubble 
 space telescope'
 as one token, we cannot search for 'hubble NEAR telescope'
 You may (not) be surprised, but Lucene already supports ALL of these 
 requirements. The patch is an attempt to state the problem differently. I am 
 not sure if it is the best option, however it works perfectly for our needs 
 and it seems it could work for general public too. Especially if the 
 SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and 
 people would just choose what situation they use. Please look at the unittest.
 links:
 [1] https://issues.apache.org/jira/browse/LUCENE-1622
 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
 [3] seems to have similar request: 
 http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4499) Multi-word synonym filter (synonym expansion)

2012-11-30 Thread Roman Chyla (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507440#comment-13507440
 ] 

Roman Chyla commented on LUCENE-4499:
-

Hi Nolan, your case seems to confirm a need for some solution. You have decided 
to make a seaprate query parser, I have put the expanding logic into a query 
parser as well.

See this for the working example:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java

And its config
https://github.com/romanchyla/montysolr/blob/master/contrib/examples/adsabs/solr/collection1/conf/schema.xml#L325

I see two added benefits (besides not needing a query parser plugin - in our 
case, it must be plugged into our qparser):

 1. you can use the filter at index/query time inside a standard query parser
 2. special configuration for synonym expansion (for example, we have found it 
very useful to be able to search for multi-tokens in case-insensitive manner, 
but recognize single tokens only case-sensitively; or expand with multi-token 
synonyms only for multi-word originals and output also the original words, 
otherwise eat them (replace them))

Nice blog post, I wish I could write as instructively as well :)

 Multi-word synonym filter (synonym expansion)
 -

 Key: LUCENE-4499
 URL: https://issues.apache.org/jira/browse/LUCENE-4499
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Affects Versions: 4.1, 5.0
Reporter: Roman Chyla
Priority: Minor
  Labels: analysis, multi-word, synonyms
 Fix For: 5.0

 Attachments: LUCENE-4499.patch


 I apologize for bringing the multi-token synonym expansion up again. There is 
 an old, unresolved issue at LUCENE-1622 [1]
 While solving the problem for our needs [2], I discovered that the current 
 SolrSynonym parser (and the wonderful FTS) have almost everything to 
 satisfactorily handle both the query and index time synonym expansion. It 
 seems that people often need to use the synonym filter *slightly* differently 
 at indexing and query time.
 In our case, we must do different things during indexing and querying.
 Example sentence: Mirrors of the Hubble space telescope pointed at XA5
 This is what we need (comma marks position bump):
 indexing: mirrors,hubble|hubble space 
 telescope|hst,space,telescope,pointed,xa5|astroobject#5
 querying: +mirrors +(hubble space telescope | hst) +pointed 
 +(xa5|astroboject#5)
 This translated to following needs:
   indexing time: 
 single-token synonyms = return only synonyms
 multi-token synonyms = return original tokens *AND* the synonyms
   query time:
 single-token: return only synonyms (but preserve case)
 multi-token: return only synonyms
  
 We need the original tokens for the proximity queries, if we indexed 'hubble 
 space telescope'
 as one token, we cannot search for 'hubble NEAR telescope'
 You may (not) be surprised, but Lucene already supports ALL of these 
 requirements. The patch is an attempt to state the problem differently. I am 
 not sure if it is the best option, however it works perfectly for our needs 
 and it seems it could work for general public too. Especially if the 
 SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and 
 people would just choose what situation they use. Please look at the unittest.
 links:
 [1] https://issues.apache.org/jira/browse/LUCENE-1622
 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
 [3] seems to have similar request: 
 http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org