[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-10-10 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124366#comment-13124366
 ] 

Jan Høydahl commented on SOLR-1979:
---

Fixed overview.html in branch

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-10-10 Thread T Jake Luciani (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124343#comment-13124343
 ] 

T Jake Luciani commented on SOLR-1979:
--

build on 3x branch still failing because 
solr/contrib/langid/src/java/overview.html was only committed to trunk. This 
file needs to be added to branch_3x as well.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-10-05 Thread Mark Miller (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121569#comment-13121569
 ] 

Mark Miller commented on SOLR-1979:
---

Nice! Great feature to get in - thanks guys.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107867#comment-13107867
 ] 

Jan Høydahl commented on SOLR-1979:
---

Question: Since I plan to commit this for both 3.x and 4.x, I will be adding 
the CHANGES entry under 3.5 section, also for TRUNK. I know there have been 
some discussion around where to log changes, but as long as 4.0 is not released 
before 3.5, it will always be true that the feature was released in 3.5 and 
exists for all later revisions, not?

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102723#comment-13102723
 ] 

Jan Høydahl commented on SOLR-1979:
---

Any changes you'd like before committing this? Lance, what config param changes 
did you have in mind?

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102646#comment-13102646
 ] 

Jan Høydahl commented on SOLR-1979:
---

Yep, it will skip detection if the field defined in langid.langField is not 
emtpty and langid.overwrite==false

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102578#comment-13102578
 ] 

Markus Jelsma commented on SOLR-1979:
-

Hi. This is not what i understood from reading the wiki doc. Will the update 
processor skip detection with these settings? It's rather costly on many docs.

Anyway, this is great work already, thanks!

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102573#comment-13102573
 ] 

Jan Høydahl commented on SOLR-1979:
---

@Markus: Sure. If you put your pre-known language code in the same field 
configured in langid.langField and use langid.overwrite=false, you will obtain 
that behavior.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102520#comment-13102520
 ] 

Markus Jelsma commented on SOLR-1979:
-

Hi Jan,

Can we also use the mapping feature without detection? Our detection is done in 
a Nutch cluster so we already identified many millions of docs.

Thanks

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102374#comment-13102374
 ] 

Jan Høydahl commented on SOLR-1979:
---

An updated documentation of the Processor is now at 
http://wiki.apache.org/solr/LanguageDetection

@Lance: What params were on your mind as candidates for keyword instead of 
true/false, and for what potential future reasons?

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-09 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101612#comment-13101612
 ] 

Lance Norskog commented on SOLR-1979:
-

I'm impressed! This is a lot of work and empirical testing for a difficult 
problem.

Comments:
There are a few parameters that are true/false, but in the future you might 
want a third answer. It might be worth making the decision via a keyword so you 
can add new keywords later.

About the multiple languages in one field problem: you can't solve everything 
at once. The other document analysis components like UIMA should be able to 
identify parts of documents, and then you use this on one part at a time. This 
is the point of a modular toolkit: you combine the tools to solve advanced 
problems.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076259#comment-13076259
 ] 

Jan Høydahl commented on SOLR-1979:
---

This has been tested on a real, several hundred thousand docs dataset, 
including HTML, office docs and multiple other formats and it works well.

I'd like some more pairs of eyes on this however.

One thing which is less than perfect is that the threshold conversion from Tika 
currently parses out the (internal) distance value from a String, in lack of a 
getDistance() method (TIKA-568). This is a bit of a hack, but I argue it's a 
beneficial one since we can now configure langid.threshold to something 
meaningful for our own data instead of the preset binary isReasonablyCertain(). 
As we also normalize to a value between 0-1, we abstract away the TIKA 
implementation detail, and are free to use any improved distance measures from 
TIKA in the future e.g. as a result of TIKA-369, or even plug in a non-Tika 
identifier or a hybrid solution.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.4
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053227#comment-13053227
 ] 

Jan Høydahl commented on SOLR-1979:
---

One question regarding the JUnit test: I now use
{code}
assertU(commit());
{code}
How can I add update request params to this commit? To select another update 
chain from different tests, I'd like to add update params on the fly, e.g.:
{code}
assertU(commit(), "update.chain=langid2");
{code}

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-06-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043448#comment-13043448
 ] 

Jan Høydahl commented on SOLR-1979:
---

Continuing on this implementing the ideas above...

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-14 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971400#action_12971400
 ] 

Tommaso Teofili commented on SOLR-1979:
---

bq. Keep it basic in first version. Allow for per-document and per-field 
detection. Make field-mapping configurable and optional (default off), allowing 
people to chain in their own mapper downstream if they choose.

I agree, this sounds good for a basic implementation.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971338#action_12971338
 ] 

Jan Høydahl commented on SOLR-1979:
---

{quote}
Jan, do you have any updates to the patch? I'd like to move forward with the 
basic functionality at least, but I still think we need the field mapping 
stuff, or we should punt all field mapping stuff to another processor. WDYT?
{quote}

I don't have any updates.

Keep it basic in first version. Allow for per-document and per-field detection.

Make field-mapping configurable and optional (default off), allowing people to 
chain in their own mapper downstream if they choose.

Mixed-language per field is a different beast and should be dealt with to 
later. Probably requires analysis changes as well if we want analyzers to pick 
up language from payloads or something.

My 2 cents

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971322#action_12971322
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. What about leveraging payloads (we can output term|payload strings to the 
payload field type) for associating languages with fields? 

Yeah, that could be used with mixed language text (or a marker token).  

Jan, do you have any updates to the patch?  I'd like to move forward with the 
basic functionality at least, but I still think we need the field mapping 
stuff, or we should punt all field mapping stuff to another processor.  WDYT?

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-08 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969404#action_12969404
 ] 

Erik Hatcher commented on SOLR-1979:


What about leveraging payloads (we can output term|payload strings to the 
payload field type) for associating languages with fields?

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969140#action_12969140
 ] 

Lance Norskog commented on SOLR-1979:
-

About Thai: there is a lot of South and East Asian language text out there 
written in phonetic USASCII, especially older pre-Unicode. Samples of these 
texts from different languages have ngram profiles just as distinct as the 
European languages.




> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969138#action_12969138
 ] 

Lance Norskog commented on SOLR-1979:
-

A use case for multi-language fields: PDFs with different languages in 
different columns. 

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968827#action_12968827
 ] 

Robert Muir commented on SOLR-1979:
---

bq. We also need to detect whether a language is part of a macro language, and 
add both to languages multivalue field, because it should be possible to filter 
on Norwegian (no) without specifying both nn and nb, and also for sami (smi) 
without specifying all of the specific languages.

macrolangs: http://www.sil.org/iso639-3/iso-639-3-macrolanguages_20100128.tab
collections: http://www.loc.gov/standards/iso639-5/iso639-5.tab.txt


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968820#action_12968820
 ] 

Jan Høydahl commented on SOLR-1979:
---

>>I have a plan to add profiles for the Norwegian and Sami languages when time 
>>allows: TIKA-491 TIKA-492
>Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 
>639-1 code i see is "se" but this seems to be appropriate only for North Sami.

Exactly. That's one example which will need a wider range of codes. I was 
planning to use 639-2 for those that do not have a 2-letter code, but BCP47 it 
will be now (although the end result may be more or less the same)

We also need to detect whether a language is part of a macro language, and add 
both to languages multivalue field, because it should be possible to filter on 
Norwegian (no) without specifying both nn and nb, and also for sami (smi) 
without specifying all of the specific languages.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968813#action_12968813
 ] 

Robert Muir commented on SOLR-1979:
---

bq. I have a plan to add profiles for the Norwegian and Sami languages when 
time allows: TIKA-491 TIKA-492

Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 
639-1 code i see is "se" but this seems to be appropriate only for North Sami.


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968806#action_12968806
 ] 

Jan Høydahl commented on SOLR-1979:
---

Discussion on the process for adding language profiles to TIKA should be 
continued in TIKA-546

I have a plan to add profiles for the Norwegian and Sami languages when time 
allows: TIKA-491 TIKA-492

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968786#action_12968786
 ] 

Robert Muir commented on SOLR-1979:
---

bq. Kind of random that Thai is thrown in there!

I agree, i tend to detect thai by the characters being between U+0E00 and 
U+0E7F.

anyway, if we add more languages it would be good if one of us could document 
the process, because many important ones are missing.


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968777#action_12968777
 ] 

Grant Ingersoll commented on SOLR-1979:
---

Sorry, you are right.  See 
http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties
 

{quote}
name.da=Danish
name.de=German
name.et=Estonian
name.el=Greek
name.en=English
name.es=Spanish
name.fi=Finnish
name.fr=French
name.hu=Hungarian
name.is=Icelandic
name.it=Italian
name.nl=Dutch
name.no=Norwegian
name.pl=Polish
name.pt=Portuguese
name.ru=Russian
name.sv=Swedish
name.th=Thai
{quote}

Kind of random that Thai is thrown in there!

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968760#action_12968760
 ] 

Robert Muir commented on SOLR-1979:
---

bq. Have a look at http://tika.apache.org/0.8/detection.html

That page does not have a list of languages.


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968757#action_12968757
 ] 

Grant Ingersoll commented on SOLR-1979:
---

Have a look at http://tika.apache.org/0.8/detection.html

Really, though, you need to dig into the Tika class: LanguageIdentifier.  
Adding languages, AFAICT, involves building the model accordingly and then 
letting Tika know about it via a properties file.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968753#action_12968753
 ] 

Robert Muir commented on SOLR-1979:
---

bq. I also think we need to get together and add a bunch more languages to Tika 
b/c it is pretty unacceptable to not have, at a minimum, support for the big 
Asian languages of CJK.

What languages does tika support in its identifier? I couldnt find an actual 
list only a ref to Europarl (http://www.statmt.org/europarl/), is it just those 
languages?

Also is there docs on whats necessary (legally and technically) to contribute a 
new profile... is just recording ngrams from creative commons text acceptable?


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968748#action_12968748
 ] 

Grant Ingersoll commented on SOLR-1979:
---

I'm going to be out of pocket for the next week.  If someone can put the field 
mapping stuff up, then I think we will have the basis for a good first pass at 
this, which we can then iterate on.  I also think we need to get together and 
add a bunch more languages to Tika b/c it is pretty unacceptable to not have, 
at a minimum, support for the big Asian languages of CJK.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968633#action_12968633
 ] 

Tommaso Teofili commented on SOLR-1979:
---

bq. However, have you considered extending the document model to allow metadata 
per field? Then @language would be a valid field metadata, mostly as a means 
for later processing to pick up and act on. This can be a valuable mechanism 
for other inter processor communication as well as to pass info between 
document centric processing and Analysis.

I've also thought about this option and it sounds somehow reasonable but I 
think that it'd be a very huge change on the API; so from one point of view I 
like the idea but from another standpoint I think it could lead to a 
proliferation of @metadata.
So in the end I've not a strong opinion on that but I also have to say that 
I've seen such customizations in a production environment to leverage per field 
metadata.

Regarding per field and per document language fields I think that a document 
language field could be handled with two fixed strategies/policies (that can be 
also extended):
# restrictive strategy - if different languages result to be mapped inside the 
document language field than say that document language is, for example, 
"x-unspecified"
# simple strategy - map all the retrieved languages (per field) inside the 
document language field as different values (so multivalued="true")




> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968627#action_12968627
 ] 

Jan Høydahl commented on SOLR-1979:
---

Allow for both a "language" field and a "languages" (multivalued) field.
If fields are mapped, the new name reflect the language, so I don't know if we 
need a field->lang mapping.
However, have you considered extending the document model to allow metadata per 
field? Then @language would be a valid field metadata, mostly as a means for 
later processing to pick up and act on. This can be a valuable mechanism for 
other inter processor communication as well as to pass info between document 
centric processing and Analysis.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968582#action_12968582
 ] 

Erik Hatcher commented on SOLR-1979:


Oh, and don't get me wrong, I get the multivalued language per document need 
too, here.  Anyway, it'll be easy enough add support for this to be controlled 
through configuration.  In single language per doc mode, basically concatenate 
all of the fields specified and detect on that and map into a singled value 
language field.  Language-per-field I get too, of course... just depends on the 
domain being modeled and in my experience I've seen apps designed both ways.  
Neither way is the one true way, it just depends. 

And of course Muir is smirking and saying "heck, you have multiple languages 
within a field often too, so we need to account for that somehow too".  But 
probably not here, yet.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968576#action_12968576
 ] 

Erik Hatcher commented on SOLR-1979:


If a list of fields (by name) is mapped into a corresponding parallel 
identified language code field, do we leave it up to search clients to also 
know the list of field names to jive a field (say title) with its identified 
language?

A language field shouldn't have to be multivalued - it just doesn't match the 
domain model of many search applications where there will only ever be one and 
only one language per document.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968528#action_12968528
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. So for all unmapped languages, you may want to map to a single generic 
field, or not map at all (leave field as is).

It currently leaves it in the original field.

bq. Also, if there are multiple input fields, the current patch would create 
multiple language field values requiring that field to be multi-valued. Is the 
goal here to identify a single language for a document? Or a separate language 
value for each of the input fields (which seems odd to me)?

Current patch requires multivalued language field.  I figure the main thing you 
want the lang. field for is faceting and filtering, but it can be changed.  As 
for the broader goal, I think it makes sense to detect languages per field and 
not per document.  In other words, you can have multiple languages in a single 
document.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968445#action_12968445
 ] 

Yonik Seeley commented on SOLR-1979:


bq. In skimming the current patch, it looks like fields get mapped no matter 
what. What if I just want the language detected and added as another field, but 
no field mapping desired?

Yeah, that's sort of in line with my:
bq. And just because you can detect a language doesn't mean you know how to 
handle it differently... so also have an optional catchall that handles all 
languages not specifically mapped.

So for all unmapped languages, you may want to map to a single generic field, 
or not map at all (leave field as is).
I guess it also depends on the general strategy... if you are detecting 
language on the "body" field, are we using a copyField type approach and only 
storing the body field while indexing as body_enText, or are we moving the 
field from "body" to "body_enText"?

bq. Also, if there are multiple input fields, the current patch would create 
multiple language field values requiring that field to be multi-valued. Is the 
goal here to identify a single language for a document?

I could see both making sense.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967214#action_12967214
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. There should be a way to output the language for the whole document to some 
field as some applications need to filter on language.

There is.  It's the langField.

bq. Can't we validate the output mapping (and log it!) at initialization time?

To some extent, but users can also pass it in.  

bq. We should not be using 639-1 codes in any APIs!!!

I'll update the patch.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967211#action_12967211
 ] 

Jan Høydahl commented on SOLR-1979:
---

@Grant: "I dropped the outputField setting and a number of other settings"

There should be a way to output the language for the whole document to some 
field as some applications need to filter on language.

I like making most things configurable, but with good defaults which fits most 
needs. The default could be to detect a document wide langauge from all input 
fields and output this to a "language_s" field, unless you specify params 
docLangInputFields=f1,f2.. and docLangOutputField=nn. Likewise make it easy to 
disable field renaming.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967204#action_12967204
 ] 

Yonik Seeley commented on SOLR-1979:


bq. Yonik, I wasn't planning on relying on dynamic fields necessarily. It may 
make sense to have users either predeclare the variations.

Sure, but the problem was the ease by which a generated field of 
originalname_${langcode} could clash with existing fields (regardless of if 
they are dynamic fields) due to there being many different language codes.

If we use regex naming as Jan suggests (or another configurable mechanism) then 
the issue comes down to what we configure by default or by example.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967201#action_12967201
 ] 

Robert Muir commented on SOLR-1979:
---

bq. Both also rely on those fields existing.

I don't think this check should be at "runtime" either.

What if you are indexing lots of documents and suddenly you encounter a thai 
document (or mis-detected as Thai!) and the whole thing fails?

Can't we validate the output mapping (and log it!) at initialization time?


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967191#action_12967191
 ] 

Robert Muir commented on SOLR-1979:
---

bq. Agreed.The only thing we are doing now is using the language that the 
language detector returns as part of the field name. Both of these steps are 
easily overridable. Both also rely on those fields existing.

"Easily overridable" does not solve the problem!

Please don't commit this, its so easy to just change the code, variable names, 
documentation here to say these interfaces are BCP47 language ids.

We should not be using 639-1 codes in any APIs!!!

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967186#action_12967186
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq.  but in solr, when designing up front, i was just saying we shouldn't limit 
any abstract portion to 639-1 when another implementation might support 3066 or 
BCP47... we should make sure we allow that.

Agreed.The only thing we are doing now is using the language that the language 
detector returns as part of the field name.  Both of these steps are easily 
overridable.  Both also rely on those fields existing.

bq. This could be problematic given a large set of language codes since they 
could collide with existing dynamic field definitions.

Yonik, I wasn't planning on relying on dynamic fields necessarily.  It may make 
sense to have users either predeclare the variations.

All in all, I would like to see Solr have better support for languages in both 
the schema and the config.  In my experience, in apps that have to support a 
lot of languages, there is a lot of redundancy in both the schema and the 
config.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967076#action_12967076
 ] 

Robert Muir commented on SOLR-1979:
---

{quote}
It makes sense to allow for detecting languages outside 639-1, and I believe 
RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 
2-letter code for a language it will be used. 639-1 is what "everyone" already 
knows.

In general, improvements should be done in Tika space, then use those in Solr, 
thus building one strong language detection library.
{quote}

yes they do, the 639-1 codes that tika outputs are also valid BCP47 codes :)

but in solr, when designing up front, i was just saying we shouldn't limit any 
abstract portion to 639-1 when another implementation might support 3066 or 
BCP47... we should make sure we allow that.


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967048#action_12967048
 ] 

Grant Ingersoll commented on SOLR-1979:
---

Note, the patch still needs more tests and needs to check headers, etc. as well 
as the better field mapping and the proper language support that Robert is 
talking about.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967046#action_12967046
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. @Grant: I actually planned to do the regEx based field name mapping in a 
separate UpdateProcessor, to make things more flexible

I don't really see that it makes it any more flexible.  If it was a general 
purpose mapper, maybe, but since it is tied to the language field, why not just 
put in the language processor?  I've already got the method that choose the 
output field as a protected.  With that, one merely would need to extend it to 
provide an alternate method from what you have proposed.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967032#action_12967032
 ] 

Jan Høydahl commented on SOLR-1979:
---

@Robert: Yes, there must be a way to tell whether or not the language even has 
a profile, through some well defined method. It's not important HOW we improve 
detection certainty, but comparing the top n distances could help. I'm also a 
fan of including other metrics than profile similarity if that can help, 
however for unique scripts that will automatically be covered by profile 
similarity. Detailed solution discussions should continue in TIKA-369.

Macro languages: See TIKA-493

It makes sense to allow for detecting languages outside 639-1, and I believe 
RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 
2-letter code for a language it will be used. 639-1 is what "everyone" already 
knows.

In general, improvements should be done in Tika space, then use those in Solr, 
thus building one strong language detection library.

@Grant: I actually planned to do the regEx based field name mapping in a 
separate UpdateProcessor, to make things more flexible. Example:
{code:xml} 
  
language
(.*?)_lang
$1_$lang
$1_t
de,en,fr,it,es,nl
  
{code} 

Your thought of allowing to detect language for individual fields in one go is 
also interesting. I'd love to see metadata support in SolrInputDocument, so 
that one processor could annotate a @language on the fields analyzed. Then next 
processor could act on metadata to rename field...

@Yonik: By allowing regex naming of field names, we give users a generic tool 
to avoid field name clashes, by picking the pattern.. Mapping multiple 
languages to same suffix also makes sense.


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967019#action_12967019
 ] 

Robert Muir commented on SOLR-1979:
---

bq. Yeah, that makes sense, however, I believe Tika returns 639.

Right, but 639 is just a subset of 3066 etc. 

So, ignore what tika does. its 639 identifiers are also valid 3066.

Our API should at least be 3066, Java7/ICU already support BCP47 locale 
identifiers etc, so you get the normalization there for free.

{quote}
It would probably also be nice to be able to map a number of languages to a 
single field say you have a single analyzer that can handle CJK, then you 
may want that whole collection of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle 
it differently... so also have an optional catchall that handles all languages 
not specifically mapped.
{quote}

Both of these are good reasons why we must avoid 639-1.
We should be able to use things like macrolanguages and undetermined language.





> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967016#action_12967016
 ] 

Yonik Seeley commented on SOLR-1979:


bq. The new field is made by concatenating the original field name with "_" + 
the ISO 639 code. 

This could be problematic given a large set of language codes since they could 
collide with existing dynamic field definitions.
Perhaps something with "text" in the name also?

Perhaps fieldName_${langCode}Text

Examples:
name_enText
name_frText

It would probably also be nice to be able to map a number of languages to a 
single field say you have a single analyzer that can handle CJK, then you 
may want that whole collection of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle 
it differently... so also have an optional catchall that handles all languages 
not specifically mapped.




> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967011#action_12967011
 ] 

Grant Ingersoll commented on SOLR-1979:
---

Another thought, here, is that, over time, this class becomes a base class and 
it becomes easy to replace the language detection piece, that way one gets all 
the infrastructure of this class, but can plugin their own detection.  In fact, 
I'm going to do that right now.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967010#action_12967010
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. I would like to see RFC 3066 instead

Yeah, that makes sense, however, I believe Tika returns 639. (Tika doesn't 
recognize Chinese yet at all).  One approach is we could normalize, I suppose.  
Another is to fix Tika.  I'd really like to see Tika support more languages, 
too.

Longer term, I'd like to not do the fieldName_LangCode thing at all and instead 
let the user supply a string that could have variable substitution if they 
want, something like fieldName_${langCode}, or it could be 
${langCode}_fieldName or it could just be another literal.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966978#action_12966978
 ] 

Robert Muir commented on SOLR-1979:
---

We really need to not be using ISO 639-1 here. 

For example,
Its not expressive enough, not differentiating between Simplified and 
Traditional chinese, yet SmartChineseAnalyzer only works on Simplified.

I would like to see RFC 3066 instead

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966972#action_12966972
 ] 

Robert Muir commented on SOLR-1979:
---

bq. cause that distance measure is kind of an internal value, not very 
normalized and is bound to change in future versions of TIKA.

bq. we can make a new isReasonablyCertain() implementation taking into account 
the relative distance between first and second candidate languages...

I don't follow the logic: if its not very normalized then it seems like this 
approach doesnt tell you anything... language 1 could be uncertain,
 and language 2 is just completely uncertain, but that tells you nothing: isn't 
it like trying to determine if a good lucene search result score is "certainly 
a hit" and not really the right way to go?

For example: consider the case where the language isn't supported at all by 
Tika (i dont see a list of supported languages anywhere by the way!).
It would be good for us to know that the detection is uncertain at all... how 
relatively uncertain it is with regards to the next language, is not very 
important.

I think its also important we be able to get this uncertainty or whatever 
different agnostic of the implementation.
For example, we should be able to somehow think of chaining detectors... 

Its really important to "cheat" and not use heuristics for languages that don't 
need them.
For example, disregarding some strange theoretical/historical cases, you can 
simply look at the unicode properties 
in the document to determine that its in the Greek language, as its basically 
the only modern language using the greek alphabet


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966970#action_12966970
 ] 

Jan Høydahl commented on SOLR-1979:
---

The idField input parameter is just used for decent logging if detection fails. 
It would be more elegant to get the id field name automatically through 
SolrCore...

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966964#action_12966964
 ] 

Jan Høydahl commented on SOLR-1979:
---

Simply allowing to set the threshold for isReasonablyCertain() is probably not 
enough to get a robust detection. This is because the distance measure is very 
sensitive to the length of the profiles in use. Thus, it is a bit dangerous to 
expose getDistance() as in TIKA-568, cause that distance measure is kind of an 
internal value, not very normalized and is bound to change in future versions 
of TIKA.

See TIKA-369 and TIKA-496.

I think the right way to go is solving these two issues first. By fixing so 
that getDisance() is not biased towards profile length, we can make a new 
isReasonablyCertain() implementation taking into account the relative distance 
between first and second candidate languages...

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966955#action_12966955
 ] 

Grant Ingersoll commented on SOLR-1979:
---

See http://wiki.apache.org/solr/LanguageDetection for the start of 
documentation.

bq. isReasonablyCertain() always returns false

See TIKA-568.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899568#action_12899568
 ] 

Jan Høydahl commented on SOLR-1979:
---

I have implemented a first shot patch using the Tika LanguageIdentifier. It is 
unfortunately quite limited in features, and for short text segments, 
isReasonablyCertain() always returns false :( Also, the number of languages 
supported is still quite low. But it works as a start, and then we can focus on 
improving the Tika code in future releases.

I plan on putting the patch in contrib/extraction, since it depends on Tika. If 
I put it relative to main, Solr will not compile unless you put tika jar in 
lib. Agree?

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Priority: Minor
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we should wrap the [Nutch 
> LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
>  in an UpdateProcessor. The processor should be configured like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> title,teaser,body
> language
> language_display
> 
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-06-30 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884070#action_12884070
 ] 

Chris A. Mattmann commented on SOLR-1979:
-

I would look at the Language Identifier in Tika (which is based on the Nutch 
work) as it is likely to be the one that is more maintained going forward 
IMHO...

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Priority: Minor
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we should wrap the [Nutch 
> LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
>  in an UpdateProcessor. The processor should be configured like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> title,teaser,body
> language
> language_display
> 
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org