subject:"\"\\\[jira\\\] \\\[Updated\\\] \\\(SOLR\\\-1979\\\) Create LanguageIdentifierUpdateProcessor\""

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-10-05 Thread Updated


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch
SOLR-1979-branch_3x.patch

Added final patches which will be committed now.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979-branch_3x.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-29 Thread Updated


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

New patch:
* Added contrib folders to eclipse dot.classpath
* Added javadoc entries to build.xml
* Fixed Javadoc errors
* Upgraded test case to use schema v1.4

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-21 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Fixed java.lang.IndexOutOfBoundsException bug in resolveLanguage() when no 
languages detected. Added more corner case tests.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-19 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Added link to Wiki in example update chain in solrconfig

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-19 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Some further improvements:
* Default fallback language if none set is now "" to avoid nullpointer exception
* All individually detected languages are now added to "langsField" array
* More tests

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Component/s: contrib - LangId

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LangId, update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

  Description: 
Language identification from document fields, and mapping of field names to 
language-specific fields based on detected language.

Wrap the Tika LanguageIdentifier in an UpdateProcessor.

See user documentation at http://wiki.apache.org/solr/LanguageDetection

  was:
Language identification from document fields, and mapping of field names to 
language-specific fields based on detected language.

Wrap the Tika LanguageIdentifier in an UpdateProcessor.

Fix Version/s: 4.0

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5, 4.0
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
> See user documentation at http://wiki.apache.org/solr/LanguageDetection

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

New patch with these improvements:

* Now also allows config at first level, without 
* Added langid to example schema (commented out), so it is really easy to 
demonstrate

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-12 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Patch updated to fit new directory structure, updated comments to point to Wiki 
doc.

Also optimized regex, now pre-compiling patterns instead of using 
String.replace directly.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-09-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Fix Version/s: (was: 3.4)
   3.5

Moving to 3.5

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.5
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-08-03 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Updated to latest trunk, simplified build file, added clean target

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.4
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-06-26 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 
gives certainty 0.9. The default threshold of 0.5 now works pretty well, at 
least for the tests...

*New parameters:*
Field name mapping is now configurable to user defined pattern, so to map 
ABC_title to title_, you set:
{code}
&langid.map.pattern=ABC_(.*)
&langid.map.replace=$1_{lang}
{code}
A parameter to map multiple detected languages to same field regex. I.e. to map 
both Japanese, Korean and Chinese texts to a field *_cjk, do:
{code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code}
Turn off validation of field names against schema (useful if you want to rename 
or delete fields later in the UpdateChain):
{code}&langid.enforceSchema=false{code}

*Other changes*
Removed default on langField, i.e. if langField is not specified, the detected 
language will not be written anywhere. A typical minimal config for only 
detecting language and writing to a field is now:
{code}

   
 title,subject,text,keywords
 language_s
   

{code}

Also added multiple other languages to the tests.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.4
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-06-26 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Fix Version/s: 3.4
   Labels: UpdateProcessor  (was: )

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
>  Labels: UpdateProcessor
> Fix For: 3.4
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-06-22 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Description: 
Language identification from document fields, and mapping of field names to 
language-specific fields based on detected language.

Wrap the Tika LanguageIdentifier in an UpdateProcessor.

  was:
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
processor is configurable like this:

{code:xml} 
  
name,subject
language_s
id
en
  
{code} 

It will then read the text from inputFields name and subject, perform language 
identification and output the ISO code for the detected language in the 
outputField. If no language was detected, fallback language is used.


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to 
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2011-06-22 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

New version. Example of accepted params:

{code}
 
   
 true
 title,subject,text,keywords
 language_s
 languages
 false
 0.5
 no,en,es,dk
 true
 title,text
 false
 false
 false
 
 meta_content_language,lang
 en
   
 
{code}

The only mandatory parameter is langid.fl
To enable field name mapping, set langid.map=true. It will then map field names 
for all fields in langid.fl. If the set of fields to map is different from 
langid.fl, supply langid.map.fl. Those fields will then be renamed with a 
language suffix equal to the language detected from the langid.fl fields.

If you require detecting languages separately for each field, supply 
langid.map.individual=true. The supplied fields will then be renamed according 
to detected language on an individual basis. If the set of fields to detect 
individually is different from the already supplied langid.fl or langid.map.fl, 
supply langid.map.individual.fl. The fields listed in langid.map.individual.fl 
will then be detected individually, while the rest of the mapping fields will 
be mapped according to global document language.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-06 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Removes mentions of ISO 639.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, 
> SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Here's a patch that passes the tests.  Note, I modified the Solr base test case 
to have some new methods to properly call update handlers and then validate the 
results.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1979:
--

Attachment: SOLR-1979.patch

I took Jan's and Tommaso's patches and reworked them a bit.  It seems to me 
that there isn't much point in merely identifying the language if you aren't 
going to do something about it.  So, this patch builds on what Jan and Tommaso 
did and then will remap the input fields to new per language fields (note, we 
could make this optional).  I also tried to standardize the input parameters a 
bit.  I dropped the outputField setting and a number of other settings and I 
made the language detection to be per input field.  The basic gist of it is 
that if you input two fields: name, subject, it will detect the language of 
each field and then attempt to map them to a new field.  The new field is made 
by concatenating the original field name with "_" + the ISO 639 code.  For 
example, if en is the detected language, then the new field for name would be 
name_en.  If that field doesn't exist, it will fall back to the original field 
(i.e. name).

Left to do:
# Fix the tests.  I don't like how we currently tests UpdateProcessorChains.  
It should not require writing your own little piece of update mechanism.  You 
should be able to simply setup the appropriate configuration, hook it into an 
update handler and then hit that update handler.  
# Need to check the license headers, builds, etc.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-03 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Description: 
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
processor is configurable like this:

{code:xml} 
  
name,subject
language_s
id
en
  
{code} 

It will then read the text from inputFields name and subject, perform language 
identification and output the ISO code for the detected language in the 
outputField. If no language was detected, fallback language is used.

  was:
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we should wrap the [Nutch 
LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
 in an UpdateProcessor. The processor should be configured like this:

{code:xml} 
  
title,teaser,body
language
language_display

{code} 


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Priority: Minor
> Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
> processor is configurable like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> name,subject
> language_s
> id
> en
>   
> {code} 
> It will then read the text from inputFields name and subject, perform 
> language identification and output the ISO code for the detected language in 
> the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-03 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Attachment: SOLR-1979.patch

First raw patch implementing language identification.

> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Priority: Minor
> Attachments: SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we should wrap the [Nutch 
> LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
>  in an UpdateProcessor. The processor should be configured like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> title,teaser,body
> language
> language_display
> 
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-06-30 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-1979:
--

Description: 
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we should wrap the [Nutch 
LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
 in an UpdateProcessor. The processor should be configured like this:

{code:xml} 
  
title,teaser,body
language
language_display

{code} 

  was:
We need the ability to detect language of some random text in order to act upon 
it, such as indexing the content into language aware fields. Another usecase is 
to be able to filter/facet on language on random unstructured content.

To do this, we should wrap the [Nutch 
LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
 in an UpdateProcessor. The processor should be configured like this:

{{monospaced}}
  
title,teaser,body
language
language_display

{{monospaced}}


> Create LanguageIdentifierUpdateProcessor
> 
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Jan Høydahl
>Priority: Minor
>
> We need the ability to detect language of some random text in order to act 
> upon it, such as indexing the content into language aware fields. Another 
> usecase is to be able to filter/facet on language on random unstructured 
> content.
> To do this, we should wrap the [Nutch 
> LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html";]
>  in an UpdateProcessor. The processor should be configured like this:
> {code:xml} 
>class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
> title,teaser,body
> language
> language_display
> 
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

21 matches

Site Navigation

Mail list logo

Footer information