Re: why query chinese character with bracket become phrase query by default?
On Sun, May 15, 2011 at 7:44 PM, Mark Miller markrmil...@gmail.com wrote: Could you please revert your commit, until we've reached some consensus on this discussion first? Let's reach some consensus, but why revert? This has been the behavior - shouldn't the consensus onus be on changing it to begin with? That's how I see it. To be clear, I'm asking that Yonik revert his commit from yesterday (rev 1103444), where he added text_nwd fieldType and dynamic fields *_nwd to the example schema.xml. I agree we should reach consensus before changing what's already committed, that's exactly why I'm asking Yonik to revert -- we were in the middle of discussing this, and I had posted a patch on SOLR-2519, when he suddenly committed the text_nwd change, yesterday. Does anyone disagree that Yonik's commit was inappropriate? This is not how we work at Apache. I'm going to need to get back up to speed on this issue before I can comment more helpfully. Better out of the box support for other languages is important - I think it makes sense to discuss this issue again myself. +1 Solr, out of box, is just awful for non-whitespace languages (eg CJK, and others). And for every user who comes to the list asking for help (thank you cyang2010!), I imagine there are many others who simply gave up and walked away (from Solr) when they tried it on CJK content. Lucene has made awesome strides in having natural defaults that work well across many languages, thanks to the hard work of Robert and others (StandardAnalyzer now actually follows a standard (UAX #29 -- text segmentation), autophrase off in QP, etc.), and I think we should take advantage of this in Solr, just like ElasticSearch does. Really, the best solution (I think) would be to have language-specific fieldTypes (text_en, text_zh, etc.), but I suspect there's a good amount of work to reach that so in the meantime I think we should fix the defaults for the text fieldType to work well across many languages. Mike http://blog.mikemccandless.com
Re: why query chinese character with bracket become phrase query by default?
On May 16, 2011, at 5:30 AM, Michael McCandless wrote: Does anyone disagree that Yonik's commit was inappropriate? This is not how we work at Apache. Ah - dunno yet - I obviously missed part of the conversation here. I thought you where talking about reversing 'autophrase off' as the default, not these 'quick' new field types. Excuse me for a moment while I read... Yeah - seems a little hasty. Not a fan of 'text_nwd' as a field name either. Didn't seem malicious to me, but it does seem we should probably work together in JIRA/discussion before just shotgunning changes... Don't know that I care if it's reverted (if we fall back another 10 steps into that BS I quit everything and I'm moving to South America), but we should push on here either way. - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org
Re: why query chinese character with bracket become phrase query by default?
On Sun, May 15, 2011 at 1:48 PM, Michael McCandless luc...@mikemccandless.com wrote: Could you please revert your commit, until we've reached some consensus on this discussion first? Huh? I thought everyone was in agreement that we needed more field types for different languages? I added my best guess about what a generic type for non-whitespace-delimited might look like. Since it's a new field type, it doesn't affect anything. Hopefully it only improves the situation for someone trying to use one of these languages. The only negative would seem to be if it's worse than nothing (i.e. a very bad example because it actually doesn't work for non-whitespace-delimited languages). The issue about changing defaults on TextField and changing what text does in the example schema by default is not dependent on this. They are only related by the fact that if another field is added/changed then _nwd may become redundant and can be removed. For now, it only seems like an improvement? Anyway... the whole language of revert seems unnecessarily confrontational. Feel free to improve what's there (or delete *_nwd if people really feel it adds no/negative value) -Yonik
Re: why query chinese character with bracket become phrase query by default?
On Mon, May 16, 2011 at 5:30 AM, Michael McCandless luc...@mikemccandless.com wrote: To be clear, I'm asking that Yonik revert his commit from yesterday (rev 1103444), where he added text_nwd fieldType and dynamic fields *_nwd to the example schema.xml. So... your position is that until the text fieldType is changed to support non-whitespace-delimited languages better, that no other fieldType should be changed/added to better support non-whitespace-delimited languages? Man, that seems political, not technical. Whatever... I'll revert. -Yonik
Re: why query chinese character with bracket become phrase query by default?
On Mon, May 16, 2011 at 3:51 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, May 16, 2011 at 5:30 AM, Michael McCandless luc...@mikemccandless.com wrote: To be clear, I'm asking that Yonik revert his commit from yesterday (rev 1103444), where he added text_nwd fieldType and dynamic fields *_nwd to the example schema.xml. So... your position is that until the text fieldType is changed to support non-whitespace-delimited languages better, that no other fieldType should be changed/added to better support non-whitespace-delimited languages? Man, that seems political, not technical. To me it seems neither nor. Its rather the process of improving aligned with outstanding issues. It shouldn't feel wrong. Simon Whatever... I'll revert. -Yonik
Re: why query chinese character with bracket become phrase query by default?
On Mon, May 16, 2011 at 9:51 AM, Yonik Seeley yo...@lucidimagination.com wrote: To be clear, I'm asking that Yonik revert his commit from yesterday (rev 1103444), where he added text_nwd fieldType and dynamic fields *_nwd to the example schema.xml. So... your position is that until the text fieldType is changed to support non-whitespace-delimited languages better, that no other fieldType should be changed/added to better support non-whitespace-delimited languages? No, that's not my position at all. My position is: please don't suddenly commit changes, with your way, while we're still discussing how to solve the issue. That's not the Apache way. This applies in general, not just this case (fixing Solr's out-of-the-box behavior with non-whitespace languages). So, it could very well be, after we iterate on SOLR-2519, that we all agree your baby step is great, in which case let's go forward with that. But we should all come to some consensus about that before you suddenly commit. Man, that seems political, not technical. I'm sorry you feel that way, but it's important to me that we all follow the Apache way here. I feel this will only make our community stronger. It's also important that any time another committer is uncomfortable with what just got committed, and asks for a revert, that it *not* be a big deal. It's not political, it was just a mistake and the revert is quick and painless. We are commit-then-review here, and if someone is uncomfortable, they should say so and whoever committed should simply revert it and re-iterate. This should be a simple free tool for all of us to use. Whatever... I'll revert. Thank you. Mike
Re: why query chinese character with bracket become phrase query by default?
On Mon, May 16, 2011 at 10:06 AM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, May 16, 2011 at 9:51 AM, Yonik Seeley yo...@lucidimagination.com wrote: To be clear, I'm asking that Yonik revert his commit from yesterday (rev 1103444), where he added text_nwd fieldType and dynamic fields *_nwd to the example schema.xml. So... your position is that until the text fieldType is changed to support non-whitespace-delimited languages better, that no other fieldType should be changed/added to better support non-whitespace-delimited languages? No, that's not my position at all. My position is: please don't suddenly commit changes, with your way, while we're still discussing how to solve the issue. That's not the Apache way. Dude... everyone has always agreed we need more fieldtypes to support different languages (as you did earlier in this thread too). There's been a history of just adding stuff like that (half of the commits to the example schema have no associated JIRA issue). What happens to the default text field will have no bearing on that. We will still need more field types to support more languages. Would you be against me adding a text_cjk fieldtype too? My position: it's silly for a lack of consensus on the text field to block progesss on any other fieldtype. -Yonik
Re: why query chinese character with bracket become phrase query by default?
On Mon, May 16, 2011 at 10:22 AM, Yonik Seeley yo...@lucidimagination.com wrote: My position is: please don't suddenly commit changes, with your way, while we're still discussing how to solve the issue. That's not the Apache way. Dude... everyone has always agreed we need more fieldtypes to support different languages (as you did earlier in this thread too). +1, and I still agree that'd be best. In that ideal future we would have no more text fieldType, only text_zh, text_en, etc. There's been a history of just adding stuff like that (half of the commits to the example schema have no associated JIRA issue). I wasn't objecting to the lack of a referenced JIRA issue; I was objecting to you suddenly committing 'your way while we were still discussing what to do. What happens to the default text field will have no bearing on that. That's not really true? I think any changes we make to any default text* fieldTypes are strongly related. For example, if we fix the text fieldType to have good all-around defaults for all languages (ie, the patch on SOLR-2519) then we don't need separate text_nwd/*_nwd field types. Instead, maybe we could add text_autophrase fieldTypes? Or maybe text_en_autophrase? We will still need more field types to support more languages. Right. Would you be against me adding a text_cjk fieldtype too? text_cjk would be *awesome*, but text_zh, text_ja, text_ko would be even better! If we fix text fieldType to be generic for all languages (use StandardAnalyzer, turn off autophrase), but then go and add in specific languages over time (say text_en, text_cjk, etc.), I think that's a great way to iterate towards the ideal future where we have text_XX coverage for many languages. My position: it's silly for a lack of consensus on the text field to block progesss on any other fieldtype. I disagree; I think changes to text fieldType are very much tied up to what other text_* fieldTypes we want to introduce. This is a *really* important configuration file in Solr and we should present good defaults with it. People who first use Solr start with the schema.xml as their starting point. People who first start with ElasticSearch today get StandardAnalyzer and no autophrase as the default, which is the best overall default Lucene has to offer right now. I think Solr should do the same. So to sum up, I think we should: 1) Fix text fieldType to stop destroying non-whitespace languages, and use the best general defaults we have to offer today (switch from WhitespaceTokenizer - StandardTokenizer, and turn off autophrase); this is the patch on SOLR-2519. 2) Add in text_XX specific language field types for as many as we can now, iterating over time to add more as we can / people get the itch. We now have a fabulous analysis module (thank you Robert!), so we should take advantage of that and at least make text_XX for all the matching analyzers in there. Let's continue this on the issue... Mike http://blog.mikemccandless.com
Re: why query chinese character with bracket become phrase query by default?
: Does anyone disagree that Yonik's commit was inappropriate? This is : not how we work at Apache. FWIW: I don't see how Yonik's commit was inappropriate at all He added some new example configuration to trunk that was unused, and in no way un-did or blocked any other attempts at improving the configs. It had no impact on any existing usage, and only served as an example (which could be iterated forward) I seriously don't see the problem here. -Hoss
Re: why query chinese character with bracket become phrase query by default?
On Fri, May 6, 2011 at 8:49 AM, Michael McCandless luc...@mikemccandless.com wrote: Shouldn't we have field types in the eg schema for the different languages? Ie, text_zh, text_th, text_en, text_ja, text_nl, etc. In fact, until we break out dedicated language field types, shouldn't we default autophrase to off in Solr? I think this is what ElasticSearch does (just inherits Lucene's default for this) -- Shay, or any ElasticSearch users out there... can you confirm? Leaving autophrase on is catastrophic for non-whitespace languages (CJK and others), and at best iffy for whitespace languages (ie, unexpected that the QueryParser would make a PhraseQuery when user hadn't asked for one, not clear it really helps relevance for whitespace languages, definitely hurts performance), so leaving it is doing far more damage than good, as far as I can tell. Any objections to turning off autophrase by default in Solr, until we have per-language field types? Mike http://blog.mikemccandless.com
Re: why query chinese character with bracket become phrase query by default?
I opened https://issues.apache.org/jira/browse/SOLR-2519 for this. Mike http://blog.mikemccandless.com On Sun, May 15, 2011 at 8:02 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, May 6, 2011 at 8:49 AM, Michael McCandless luc...@mikemccandless.com wrote: Shouldn't we have field types in the eg schema for the different languages? Ie, text_zh, text_th, text_en, text_ja, text_nl, etc. In fact, until we break out dedicated language field types, shouldn't we default autophrase to off in Solr? I think this is what ElasticSearch does (just inherits Lucene's default for this) -- Shay, or any ElasticSearch users out there... can you confirm? Leaving autophrase on is catastrophic for non-whitespace languages (CJK and others), and at best iffy for whitespace languages (ie, unexpected that the QueryParser would make a PhraseQuery when user hadn't asked for one, not clear it really helps relevance for whitespace languages, definitely hurts performance), so leaving it is doing far more damage than good, as far as I can tell. Any objections to turning off autophrase by default in Solr, until we have per-language field types? Mike http://blog.mikemccandless.com
Re: why query chinese character with bracket become phrase query by default?
On Sun, May 15, 2011 at 8:02 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, May 6, 2011 at 8:49 AM, Michael McCandless luc...@mikemccandless.com wrote: Shouldn't we have field types in the eg schema for the different languages? Ie, text_zh, text_th, text_en, text_ja, text_nl, etc. In fact, until we break out dedicated language field types, shouldn't we default autophrase to off in Solr? I've taken a crack at a generic text field for non-whitespace-delimited fields to the example schema: !-- A general unstemmed text field that is better for non whitespace delimited languanges (nwd) due to autoGeneratePhraseQueries=false -- fieldType name=text_nwd class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType dynamicField name=*_nwd type=text_nwd indexed=true stored=true/ You can try it out on trunk with a query like: http://localhost:8983/solr/select?q=name_nwd:F-11debugQuery=true And verify it generates an OR: str name=querystringname_nwd:F-11/str str name=parsedqueryname_nwd:f name_nwd:11/str Can someone verify that the WDF params are OK (i.e. I didn't catenate since that wouldn't make sense if the word parts were actually whole words in a non-whitespace-delimited language). Does that make sense? As far as Solr defaults... perhaps way way back text should have been named text_en. But any changes now should be comprehensive (we need to consider impacts to the example data, the example schema, the solr tuturial which relies on some of the current behavior, and a ton of documentation on the wiki related to both analysis components (multi-word synonyms, WDF, etc) and other quickstart guides. Anyway, changes to the example schema (or the behavior of the example schema) can have a large impact. I personally think that adding a new field is much easier and less disruptive, and given the potential impact we should hear what others have to say about it too (I'm out the rest of today, and I know a lot of other people aren't around this weekend either). -Yonik
Re: why query chinese character with bracket become phrase query by default?
Yonik, Could you please revert your commit, until we've reached some consensus on this discussion first? Maybe, post alternative patches on the issue (SOLR-2519), and we can iterate there? Adding a new example field type (text_nwd) is one way to go, and I agree is least risk/effort, a quick fix, but I don't think we should use a quick fix here. I think it's important for Solr to have good out-of-the-box defaults for all languages, like ElasticSearch, even if that means we have to do some extra work now (ie, fixing up the wiki/tutorials) to make that change. More below: On Sun, May 15, 2011 at 12:20 PM, Yonik Seeley yo...@lucidimagination.com wrote: As far as Solr defaults... perhaps way way back text should have been named text_en. But any changes now should be comprehensive (we need to consider impacts to the example data, the example schema, the solr tuturial which relies on some of the current behavior, and a ton of documentation on the wiki related to both analysis components (multi-word synonyms, WDF, etc) and other quickstart guides. Anyway, changes to the example schema (or the behavior of the example schema) can have a large impact. I agree: we need to fix the wiki pages/examples that rely on auto-phrase. But, really, how much work is this? Can you point to an example or two in the wiki/tutorial that advertise/rely on auto phrase? This would help me get a sense of how much additional work I'm signing up for ;) I just went through the tutorial and didn't see one... (Also, we should add some CJK docs and queries to the tutorial... a simple pair is the test case in my patch on SOLR-2519.) We shouldn't avoid/fear good changes to our defaults just because fixing it will be more work, especially if someone (me!) is signing up to do that work I personally think that adding a new field is much easier and less disruptive, and given the potential impact I agree the quick fix is somewhat easier than doing it right, but I think in this case we should do it right. Solr really should just work well out-of-the-box on all (including non-whitespace) languages. we should hear what others have to say about it too +1 Mike http://blog.mikemccandless.com
Re: why query chinese character with bracket become phrase query by default?
On May 15, 2011, at 1:48 PM, Michael McCandless wrote: Could you please revert your commit, until we've reached some consensus on this discussion first? Let's reach some consensus, but why revert? This has been the behavior - shouldn't the consensus onus be on changing it to begin with? That's how I see it. I'm going to need to get back up to speed on this issue before I can comment more helpfully. Better out of the box support for other languages is important - I think it makes sense to discuss this issue again myself. - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org
Re: why query chinese character with bracket become phrase query by default?
On Thu, May 5, 2011 at 10:00 AM, Yonik Seeley yo...@lucidimagination.com wrote: 2011/5/5 Michael McCandless luc...@mikemccandless.com: The very first thing every non-whitespace language Solr app should do is turn off autoGeneratePhraseQueries! Luckily, this is configurable per FieldType... so if it doesn't exist yet, we should come up with a good CJK fieldtype to add to the example schema. +1 Shouldn't we have field types in the eg schema for the different languages? Ie, text_zh, text_th, text_en, text_ja, text_nl, etc. Mike http://blog.mikemccandless.com
Re: why query chinese character with bracket become phrase query by default?
Unfortunately, the current out-of-the-box defaults (example config) for Solr are a disaster for non-whitespace languages (CJK, Thai, etc.), ie, exactly what you've hit. This is because Lucene's QueryParser can unexpectedly, dangerously, create PhraseQuery even when the user did not ask for it (auto phrase). Not only does this mean no results for non-whitespace languages, but it also means worse search performance (PhraseQuery is usually more costly than TermQuerys). Lucene leaves this auto phrase behavior off by default, but Solr defaults it to on. Robert's email gives a good description of how you can turn it off. The very first thing every non-whitespace language Solr app should do is turn off autoGeneratePhraseQueries! Mike http://blog.mikemccandless.com On Wed, May 4, 2011 at 8:21 PM, cyang2010 ysxsu...@hotmail.com wrote: Hi, In solr admin query full interface page, the following query with english become term query according to debug : title_en_US: (blood red) lst name=debug str name=rawquerystringtitle_en_US: (blood red)/str str name=querystringtitle_en_US: (blood red)/str str name=parsedquerytitle_en_US:blood title_en_US:red/str str name=parsedquery_toStringtitle_en_US:blood title_en_US:red/str However, using the same syntax with two chinese terms, the query result into a phrase query: title_zh_CN: (我活) lst name=debug str name=rawquerystringtitle_zh_CN: (我活)/str str name=querystringtitle_zh_CN: (我活)/str str name=parsedqueryPhraseQuery(title_zh_CN:我 活)/str str name=parsedquery_toStringtitle_zh_CN:我 活/str I do have different tokenizer/filter for those two different fields. title_en_US is using all those common english specific tokenizer, while title_zh_CN uses solr.ChineseTokenizerFactory. I don't think those tokenizer determin whether things within bracket become term queries or phrase queries. I really need to blindly pass user-input text to a solr field without doing any parsing, and hope it is all doing term query for each term contained in the search text. How do i achieve that? Thanks, cy -- View this message in context: http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2901542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why query chinese character with bracket become phrase query by default?
2011/5/5 Michael McCandless luc...@mikemccandless.com: The very first thing every non-whitespace language Solr app should do is turn off autoGeneratePhraseQueries! Luckily, this is configurable per FieldType... so if it doesn't exist yet, we should come up with a good CJK fieldtype to add to the example schema. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: why query chinese character with bracket become phrase query by default?
Nice, it works like a charm. I am using solr 1.4.1. Here is my configuration for the chinese field: fieldType name=text_ch class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.ChineseTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.ChineseTokenizerFactory/ filter class=solr.PositionFilterFactory/ /analyzer /fieldType Now when I get the expected hassle free parsing on solr side: lst name=debug str name=rawquerystringtitle_zh_CN:(我活)/str str name=querystringtitle_zh_CN:(我活)/str str name=parsedquerytitle_zh_CN:我 title_zh_CN:活/str str name=parsedquery_toStringtitle_zh_CN:我 title_zh_CN:活/str -- View this message in context: http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2905784.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why query chinese character with bracket become phrase query by default?
Please see Robert's two solutions (autoGeneratePhraseQueries or PositionFilter) http://search-lucene.com/m/imED32mqqyp1/ --- On Thu, 5/5/11, cyang2010 ysxsu...@hotmail.com wrote: From: cyang2010 ysxsu...@hotmail.com Subject: why query chinese character with bracket become phrase query by default? To: solr-user@lucene.apache.org Date: Thursday, May 5, 2011, 3:21 AM Hi, In solr admin query full interface page, the following query with english become term query according to debug : title_en_US: (blood red) lst name=debug str name=rawquerystringtitle_en_US: (blood red)/str str name=querystringtitle_en_US: (blood red)/str str name=parsedquerytitle_en_US:blood title_en_US:red/str str name=parsedquery_toStringtitle_en_US:blood title_en_US:red/str However, using the same syntax with two chinese terms, the query result into a phrase query: title_zh_CN: (我活) lst name=debug str name=rawquerystringtitle_zh_CN: (我活)/str str name=querystringtitle_zh_CN: (我活)/str str name=parsedqueryPhraseQuery(title_zh_CN:我 活)/str str name=parsedquery_toStringtitle_zh_CN:我 活/str I do have different tokenizer/filter for those two different fields. title_en_US is using all those common english specific tokenizer, while title_zh_CN uses solr.ChineseTokenizerFactory. I don't think those tokenizer determin whether things within bracket become term queries or phrase queries. I really need to blindly pass user-input text to a solr field without doing any parsing, and hope it is all doing term query for each term contained in the search text. How do i achieve that? Thanks, cy -- View this message in context: http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2901542.html Sent from the Solr - User mailing list archive at Nabble.com.