On Mon, May 16, 2011 at 10:22 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:

>> My position is: please don't suddenly commit changes, with "your way",
>> while we're still discussing how to solve the issue.  That's not the
>> Apache way.
>
> Dude... everyone has always agreed we need more fieldtypes to support
> different languages (as you did earlier in this thread too).

+1, and I still agree that'd be best.  In that ideal future we would
have no more "text" fieldType, only text_zh, text_en, etc.

> There's been a
> history of just adding stuff like that (half of the commits to the example
> schema have no associated JIRA issue).

I wasn't objecting to the lack of a referenced JIRA issue; I was
objecting to you suddenly committing 'your way" while we were still
discussing what to do.

> What happens to the default "text" field will have no bearing on that.

That's not really true?  I think any changes we make to any default
"text*" fieldTypes are strongly related.

For example, if we fix the "text" fieldType to have good all-around
defaults for all languages (ie, the patch on SOLR-2519) then we don't
need separate text_nwd/*_nwd field types.  Instead, maybe we could add
text_autophrase fieldTypes?  Or maybe text_en_autophrase?

> We will still need more field types to support more languages.

Right.

> Would you be against me adding a text_cjk fieldtype too?

text_cjk would be *awesome*, but text_zh, text_ja, text_ko would be
even better!

If we fix "text" fieldType to be generic for all languages (use
StandardAnalyzer, turn off autophrase), but then
go and add in specific languages over time (say text_en, text_cjk,
etc.), I think that's a great way to iterate towards the ideal future
where we have text_XX coverage for many languages.

> My position: it's silly for a lack of consensus on the "text" field to
> block progesss on any other fieldtype.

I disagree; I think changes to "text" fieldType are very much tied up
to what other "text_* fieldTypes we want to introduce.

This is a *really* important configuration file in Solr and we should
present good defaults with it.  People who first use Solr start with
the schema.xml as their starting point.

People who first start with ElasticSearch today get StandardAnalyzer
and no autophrase as the default, which is the best overall default
Lucene has to offer right now.  I think Solr should do the same.

So to sum up, I think we should:

  1) Fix "text" fieldType to stop destroying non-whitespace languages,
     and use the best "general" defaults we have to offer today
     (switch from WhitespaceTokenizer -> StandardTokenizer, and turn
     off autophrase); this is the patch on SOLR-2519.

  2) Add in text_XX specific language field types for as many as we
     can now, iterating over time to add more as we can / people get
     the itch.  We now have a fabulous analysis module (thank you
     Robert!), so we should take advantage of that and at least make
     text_XX for all the matching analyzers in there.

Let's continue this on the issue...

Mike

http://blog.mikemccandless.com

Reply via email to