On Sun, May 15, 2011 at 7:44 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> Could you please revert your commit, until we've reached some >> consensus on this discussion first? > > Let's reach some consensus, but why revert? This has been the behavior - > shouldn't the consensus onus be on changing it to begin with? That's how I > see it. To be clear, I'm asking that Yonik revert his commit from yesterday (rev 1103444), where he added "text_nwd" fieldType and dynamic fields *_nwd to the example schema.xml. I agree we should reach consensus before changing what's already committed, that's exactly why I'm asking Yonik to revert -- we were in the middle of discussing this, and I had posted a patch on SOLR-2519, when he suddenly committed the text_nwd change, yesterday. Does anyone disagree that Yonik's commit was inappropriate? This is not how we work at Apache. > I'm going to need to get back up to speed on this issue before I can comment > more helpfully. Better out of the box support for other languages is > important - I think it makes sense to discuss this issue again myself. +1 Solr, out of box, is just awful for non-whitespace languages (eg CJK, and others). And for every user who comes to the list asking for help (thank you cyang2010!), I imagine there are many others who simply gave up and walked away (from Solr) when they tried it on CJK content. Lucene has made awesome strides in having natural defaults that work well across many languages, thanks to the hard work of Robert and others (StandardAnalyzer now actually follows a standard (UAX #29 -- text segmentation), autophrase off in QP, etc.), and I think we should take advantage of this in Solr, just like ElasticSearch does. Really, the best solution (I think) would be to have language-specific fieldTypes (text_en, text_zh, etc.), but I suspect there's a good amount of work to reach that so in the meantime I think we should fix the defaults for the "text" fieldType to work well across many languages. Mike http://blog.mikemccandless.com