Re: Using Lucene's payload in Solr
: Is it possible to have the copyField strip off the payload while it is : copying since doing it in the analysis phrase is too late? Or should I : start looking into using UpdateProcessors as Chris had suggested? nope and yep I've had an idea in the back of my mind ofr a while now about adding more options ot the fieldTypes to specify how the *stored* values should be modified when indexing ... but there's nothing there to do that yet. you have to make the modifications in an Updateprocessor (or in a response writer) : It seems like it might be simpler have two new (generic) UpdateProcessors: : one that can clone fieldA into fieldB, and one that can do regex mutations : on fieldB ... neither needs to know about payloads at all, but the first : can made a copy of 2.0|Solr In Action and the second can strip off the : 2.0| from the copy. : : then you can write a new NumericPayloadRegexTokenizer that takes in two : regex expressions -- one that knows how to extract the payload from a : piece of input, and one that specifies the tokenization. : : those three classes seem easier to implemnt, easier to maintain, and more : generally reusable then a custom xml request handler for your updates. -Hoss
Re: Using Lucene's payload in Solr
While testing my code I discovered that my copyField with PatternTokenize does not do what I want. This is what I am indexing into Solr: field name=title2.0|Solr In Action/field My copyField is simply: copyField source=title dest=titleRaw/ field titleRaw is of type title_raw: fieldType name=title_raw class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=[^#]*#(.*) group=1/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType For my example input Solr in Action is indexed into the titleRaw field without the payload. But the payload is still stored. So when I retrieve the field titleRaw I still get back 2.0|Solr in Action where what I really want is just Solr in Action. Is it possible to have the copyField strip off the payload while it is copying since doing it in the analysis phrase is too late? Or should I start looking into using UpdateProcessors as Chris had suggested? Bill On Fri, Aug 21, 2009 at 12:04 PM, Bill Au bill.w...@gmail.com wrote: I ended up not using an XML attribute for the payload since I need to return the payload in query response. So I ended up going with: field name=title2.0|Solr In Action/field My payload is numeric so I can pick a non-numeric delimiter (ie '|'). Putting the payload in front means I don't have to worry about the delimiter appearing in the value. The payload is required in my case so I can simply look for the first occurrence of the delimiter and ignore the possibility of the delimiter appearing in the value. I ended up writing a custom Tokenizer and a copy field with a PatternTokenizerFactory to filter out the delimiter and payload. That's is straight forward in terms of implementation. On top of that I can still use the CSV loader, which I really like because of its speed. Bill. On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : of the field are correct but the delimiter and payload are stored so they : appear in the response also. Here is an example: ... : I am thinking maybe I can do this instead when indexing: : : XML for indexing: : field name=title payload=2.0Solr In Action/field : : This will simplify indexing as I don't have to repeat the payload for each but now you're into a custom request handler for the updates to deal with the custom XML attribute so you can't use DIH, or CSV loading. It seems like it might be simpler have two new (generic) UpdateProcessors: one that can clone fieldA into fieldB, and one that can do regex mutations on fieldB ... neither needs to know about payloads at all, but the first can made a copy of 2.0|Solr In Action and the second can strip off the 2.0| from the copy. then you can write a new NumericPayloadRegexTokenizer that takes in two regex expressions -- one that knows how to extract the payload from a piece of input, and one that specifies the tokenization. those three classes seem easier to implemnt, easier to maintain, and more generally reusable then a custom xml request handler for your updates. -Hoss
Re: Using Lucene's payload in Solr
I ended up not using an XML attribute for the payload since I need to return the payload in query response. So I ended up going with: field name=title2.0|Solr In Action/field My payload is numeric so I can pick a non-numeric delimiter (ie '|'). Putting the payload in front means I don't have to worry about the delimiter appearing in the value. The payload is required in my case so I can simply look for the first occurrence of the delimiter and ignore the possibility of the delimiter appearing in the value. I ended up writing a custom Tokenizer and a copy field with a PatternTokenizerFactory to filter out the delimiter and payload. That's is straight forward in terms of implementation. On top of that I can still use the CSV loader, which I really like because of its speed. Bill. On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : of the field are correct but the delimiter and payload are stored so they : appear in the response also. Here is an example: ... : I am thinking maybe I can do this instead when indexing: : : XML for indexing: : field name=title payload=2.0Solr In Action/field : : This will simplify indexing as I don't have to repeat the payload for each but now you're into a custom request handler for the updates to deal with the custom XML attribute so you can't use DIH, or CSV loading. It seems like it might be simpler have two new (generic) UpdateProcessors: one that can clone fieldA into fieldB, and one that can do regex mutations on fieldB ... neither needs to know about payloads at all, but the first can made a copy of 2.0|Solr In Action and the second can strip off the 2.0| from the copy. then you can write a new NumericPayloadRegexTokenizer that takes in two regex expressions -- one that knows how to extract the payload from a piece of input, and one that specifies the tokenization. those three classes seem easier to implemnt, easier to maintain, and more generally reusable then a custom xml request handler for your updates. -Hoss
Re: Using Lucene's payload in Solr
: of the field are correct but the delimiter and payload are stored so they : appear in the response also. Here is an example: ... : I am thinking maybe I can do this instead when indexing: : : XML for indexing: : field name=title payload=2.0Solr In Action/field : : This will simplify indexing as I don't have to repeat the payload for each but now you're into a custom request handler for the updates to deal with the custom XML attribute so you can't use DIH, or CSV loading. It seems like it might be simpler have two new (generic) UpdateProcessors: one that can clone fieldA into fieldB, and one that can do regex mutations on fieldB ... neither needs to know about payloads at all, but the first can made a copy of 2.0|Solr In Action and the second can strip off the 2.0| from the copy. then you can write a new NumericPayloadRegexTokenizer that takes in two regex expressions -- one that knows how to extract the payload from a piece of input, and one that specifies the tokenization. those three classes seem easier to implemnt, easier to maintain, and more generally reusable then a custom xml request handler for your updates. -Hoss
Re: Using Lucene's payload in Solr
Thanks for sharing your code, Ken. It is pretty much the same code that I have written except that my custom QueryParser extends Solr's SolrQueryParser instead of Lucene's QueryParser. I am also using BFTQ instead of BTQ. I have tested it and do see the payload being used in the explain output. Functionally I have got everything work now. I still have to decide how I want to index the payload (using DelimitedPayloadTokenFilter or my own custom format/code). Bill On Thu, Aug 13, 2009 at 11:31 AM, Ensdorf Ken ensd...@zoominfo.com wrote: It looks like things have changed a bit since this subject was last brought up here. I see that there are support in Solr/Lucene for indexing payload data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter). Overriding the Similarity class is straight forward. So the last piece of the puzzle is to use a BoostingTermQuery when searching. I think all I need to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under the cover. I think all I need to do is to write my own query parser plugin that uses a custom query parser, with the only difference being in the getFieldQuery() method where a BoostingTermQuery is used instead of a TermQuery. The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which gives some more flexibility in terms of how the spans in a single document are scored. Am I on the right track? Yes. Has anyone done something like this already? I wrote a QParserPlugin that seems to do the trick. This is minimally tested - we're not actually using it at the moment, but should get you going. Also, as Grant suggested, you may want to sub BFTQ for BTQ below: package com.zoominfo.solr.analysis; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.*; import org.apache.lucene.search.*; import org.apache.lucene.search.payloads.BoostingTermQuery; import org.apache.solr.common.params.*; import org.apache.solr.common.util.NamedList; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.search.*; public class BoostingTermQParserPlugin extends QParserPlugin { public static String NAME = zoom; public void init(NamedList args) { } public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { System.out.print(BoostingTermQParserPlugin::createParser\n); return new BoostingTermQParser(qstr, localParams, params, req); } } class BoostingTermQueryParser extends QueryParser { public BoostingTermQueryParser(String f, Analyzer a) { super(f, a); System.out.print(BoostingTermQueryParser::BoostingTermQueryParser\n); } @Override protected Query newTermQuery(Term term){ System.out.print(BoostingTermQueryParser::newTermQuery\n); return new BoostingTermQuery(term); } } class BoostingTermQParser extends QParser { String sortStr; QueryParser lparser; public BoostingTermQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { super(qstr, localParams, params, req); System.out.print(BoostingTermQParser::BoostingTermQParser\n); } public Query parse() throws ParseException { System.out.print(BoostingTermQParser::parse\n); String qstr = getString(); String defaultField = getParam(CommonParams.DF); if (defaultField==null) { defaultField = getReq().getSchema().getSolrQueryParser(null).getField(); } lparser = new BoostingTermQueryParser(defaultField, getReq().getSchema().getQueryAnalyzer()); // these could either be checked set here, or in the SolrQueryParser constructor String opParam = getParam(QueryParsing.OP); if (opParam != null) { lparser.setDefaultOperator(AND.equals(opParam) ? QueryParser.Operator.AND : QueryParser.Operator.OR); } else { // try to get default operator from schema lparser.setDefaultOperator(getReq().getSchema().getSolrQueryParser(null).getDefaultOperator()); } return lparser.parse(qstr); } public String[] getDefaultHighlightFields() { return new String[]{lparser.getField()}; } }
RE: Using Lucene's payload in Solr
It looks like things have changed a bit since this subject was last brought up here. I see that there are support in Solr/Lucene for indexing payload data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter). Overriding the Similarity class is straight forward. So the last piece of the puzzle is to use a BoostingTermQuery when searching. I think all I need to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under the cover. I think all I need to do is to write my own query parser plugin that uses a custom query parser, with the only difference being in the getFieldQuery() method where a BoostingTermQuery is used instead of a TermQuery. The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which gives some more flexibility in terms of how the spans in a single document are scored. Am I on the right track? Yes. Has anyone done something like this already? I wrote a QParserPlugin that seems to do the trick. This is minimally tested - we're not actually using it at the moment, but should get you going. Also, as Grant suggested, you may want to sub BFTQ for BTQ below: package com.zoominfo.solr.analysis; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.*; import org.apache.lucene.search.*; import org.apache.lucene.search.payloads.BoostingTermQuery; import org.apache.solr.common.params.*; import org.apache.solr.common.util.NamedList; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.search.*; public class BoostingTermQParserPlugin extends QParserPlugin { public static String NAME = zoom; public void init(NamedList args) { } public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { System.out.print(BoostingTermQParserPlugin::createParser\n); return new BoostingTermQParser(qstr, localParams, params, req); } } class BoostingTermQueryParser extends QueryParser { public BoostingTermQueryParser(String f, Analyzer a) { super(f, a); System.out.print(BoostingTermQueryParser::BoostingTermQueryParser\n); } @Override protected Query newTermQuery(Term term){ System.out.print(BoostingTermQueryParser::newTermQuery\n); return new BoostingTermQuery(term); } } class BoostingTermQParser extends QParser { String sortStr; QueryParser lparser; public BoostingTermQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { super(qstr, localParams, params, req); System.out.print(BoostingTermQParser::BoostingTermQParser\n); } public Query parse() throws ParseException { System.out.print(BoostingTermQParser::parse\n); String qstr = getString(); String defaultField = getParam(CommonParams.DF); if (defaultField==null) { defaultField = getReq().getSchema().getSolrQueryParser(null).getField(); } lparser = new BoostingTermQueryParser(defaultField, getReq().getSchema().getQueryAnalyzer()); // these could either be checked set here, or in the SolrQueryParser constructor String opParam = getParam(QueryParsing.OP); if (opParam != null) { lparser.setDefaultOperator(AND.equals(opParam) ? QueryParser.Operator.AND : QueryParser.Operator.OR); } else { // try to get default operator from schema lparser.setDefaultOperator(getReq().getSchema().getSolrQueryParser(null).getDefaultOperator()); } return lparser.parse(qstr); } public String[] getDefaultHighlightFields() { return new String[]{lparser.getField()}; } }
Re: Using Lucene's payload in Solr
Thanks for the tip on BFTQ. I have been using a nightly build before that was committed. I have upgrade to the latest nightly build and will use that instead of BTQ. I got DelimitedPayloadTokenFilter to work and see that the terms and payload of the field are correct but the delimiter and payload are stored so they appear in the response also. Here is an example: XML for indexing: field name=titleSolr|2.0 In|2.0 Action|2.0/field XML response: doc str nametitleSolr|2.0 In|2.0 Action|2.0/str /doc I want to set payload on a field that has a variable number of words. So I guess I can use a copy field with a PatternTokenizerFactory to filter out the delimiter and payload. I am thinking maybe I can do this instead when indexing: XML for indexing: field name=title payload=2.0Solr In Action/field This will simplify indexing as I don't have to repeat the payload for each word in the field. I do have to write a payload aware update handler. It looks like I can use Lucene's NumericPayloadTokenFilter in my custom update handler to Any thoughts/comments/suggestions? Bill On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll gsing...@apache.orgwrote: On Aug 11, 2009, at 5:30 PM, Bill Au wrote: It looks like things have changed a bit since this subject was last brought up here. I see that there are support in Solr/Lucene for indexing payload data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter). Overriding the Similarity class is straight forward. So the last piece of the puzzle is to use a BoostingTermQuery when searching. I think all I need to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under the cover. I think all I need to do is to write my own query parser plugin that uses a custom query parser, with the only difference being in the getFieldQuery() method where a BoostingTermQuery is used instead of a TermQuery. The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which gives some more flexibility in terms of how the spans in a single document are scored. Am I on the right track? Yes. Has anyone done something like this already? I intend to, but haven't started. Since Solr already has indexing support for payload, I was hoping that query support is already in the works if not available already. If not, I am willing to contribute but will probably need some guidance since my knowledge in Solr query parser is weak. https://issues.apache.org/jira/browse/SOLR-1337
Re: Using Lucene's payload in Solr
On Aug 13, 2009, at 11:58 AM, Bill Au wrote: Thanks for the tip on BFTQ. I have been using a nightly build before that was committed. I have upgrade to the latest nightly build and will use that instead of BTQ. I got DelimitedPayloadTokenFilter to work and see that the terms and payload of the field are correct but the delimiter and payload are stored so they appear in the response also. Here is an example: XML for indexing: field name=titleSolr|2.0 In|2.0 Action|2.0/field XML response: doc str nametitleSolr|2.0 In|2.0 Action|2.0/str /doc Correct. I want to set payload on a field that has a variable number of words. So I guess I can use a copy field with a PatternTokenizerFactory to filter out the delimiter and payload. I am thinking maybe I can do this instead when indexing: XML for indexing: field name=title payload=2.0Solr In Action/field Hmmm, interesting, what's your motivation vs. boosting the field? This will simplify indexing as I don't have to repeat the payload for each word in the field. I do have to write a payload aware update handler. It looks like I can use Lucene's NumericPayloadTokenFilter in my custom update handler to Any thoughts/comments/suggestions? Bill On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll gsing...@apache.orgwrote: On Aug 11, 2009, at 5:30 PM, Bill Au wrote: It looks like things have changed a bit since this subject was last brought up here. I see that there are support in Solr/Lucene for indexing payload data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter). Overriding the Similarity class is straight forward. So the last piece of the puzzle is to use a BoostingTermQuery when searching. I think all I need to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under the cover. I think all I need to do is to write my own query parser plugin that uses a custom query parser, with the only difference being in the getFieldQuery() method where a BoostingTermQuery is used instead of a TermQuery. The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which gives some more flexibility in terms of how the spans in a single document are scored. Am I on the right track? Yes. Has anyone done something like this already? I intend to, but haven't started. Since Solr already has indexing support for payload, I was hoping that query support is already in the works if not available already. If not, I am willing to contribute but will probably need some guidance since my knowledge in Solr query parser is weak. https://issues.apache.org/jira/browse/SOLR-1337 -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Using Lucene's payload in Solr
I need to boost a field differently according to the content of the field. Here is an example: doc field name=nameSolr/field field name=category payload=3.0information retrieval/category field name=category payload=2.0webapp/category field name=category payload=2.0java/category field name=category payload=1.0xml/category /doc doc field name=nameTomcat/field field name=category payload=3.0webapp/category field name=category payload=2.0java/category /doc doc field name=nameXMLSpy/field field name=category payload=3.0xml/category field name=category payload=2.0ide/category /doc A seach on category:webapp should return Tomcat before Solr. A search on category:xml should return XMLSpy before Solr. Bill On Thu, Aug 13, 2009 at 12:13 PM, Grant Ingersoll gsing...@apache.orgwrote: On Aug 13, 2009, at 11:58 AM, Bill Au wrote: Thanks for the tip on BFTQ. I have been using a nightly build before that was committed. I have upgrade to the latest nightly build and will use that instead of BTQ. I got DelimitedPayloadTokenFilter to work and see that the terms and payload of the field are correct but the delimiter and payload are stored so they appear in the response also. Here is an example: XML for indexing: field name=titleSolr|2.0 In|2.0 Action|2.0/field XML response: doc str nametitleSolr|2.0 In|2.0 Action|2.0/str /doc Correct. I want to set payload on a field that has a variable number of words. So I guess I can use a copy field with a PatternTokenizerFactory to filter out the delimiter and payload. I am thinking maybe I can do this instead when indexing: XML for indexing: field name=title payload=2.0Solr In Action/field Hmmm, interesting, what's your motivation vs. boosting the field? This will simplify indexing as I don't have to repeat the payload for each word in the field. I do have to write a payload aware update handler. It looks like I can use Lucene's NumericPayloadTokenFilter in my custom update handler to Any thoughts/comments/suggestions? Bill On Wed, Aug 12, 2009 at 7:13 AM, Grant Ingersoll gsing...@apache.org wrote: On Aug 11, 2009, at 5:30 PM, Bill Au wrote: It looks like things have changed a bit since this subject was last brought up here. I see that there are support in Solr/Lucene for indexing payload data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter). Overriding the Similarity class is straight forward. So the last piece of the puzzle is to use a BoostingTermQuery when searching. I think all I need to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under the cover. I think all I need to do is to write my own query parser plugin that uses a custom query parser, with the only difference being in the getFieldQuery() method where a BoostingTermQuery is used instead of a TermQuery. The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which gives some more flexibility in terms of how the spans in a single document are scored. Am I on the right track? Yes. Has anyone done something like this already? I intend to, but haven't started. Since Solr already has indexing support for payload, I was hoping that query support is already in the works if not available already. If not, I am willing to contribute but will probably need some guidance since my knowledge in Solr query parser is weak. https://issues.apache.org/jira/browse/SOLR-1337 -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Using Lucene's payload in Solr
On Aug 11, 2009, at 5:30 PM, Bill Au wrote: It looks like things have changed a bit since this subject was last brought up here. I see that there are support in Solr/Lucene for indexing payload data (DelimitedPayloadTokenFilterFactory and DelimitedPayloadTokenFilter). Overriding the Similarity class is straight forward. So the last piece of the puzzle is to use a BoostingTermQuery when searching. I think all I need to do is to subclass Solr's LuceneQParserPlugin uses SolrQueryParser under the cover. I think all I need to do is to write my own query parser plugin that uses a custom query parser, with the only difference being in the getFieldQuery() method where a BoostingTermQuery is used instead of a TermQuery. The BTQ is now deprecated in favor of the BoostingFunctionTermQuery, which gives some more flexibility in terms of how the spans in a single document are scored. Am I on the right track? Yes. Has anyone done something like this already? I intend to, but haven't started. Since Solr already has indexing support for payload, I was hoping that query support is already in the works if not available already. If not, I am willing to contribute but will probably need some guidance since my knowledge in Solr query parser is weak. https://issues.apache.org/jira/browse/SOLR-1337