[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144357#comment-16144357 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2414: --- [~yossi] I think that [~markus.jel...@openindex.io] is suggesting implementing a generic {{IndexingFilter}} that supports JEXL expressions, this way we don't need to modify every possible {{IndexingFilter}}, this will be easier to maintain in the long run and provides a better separation. > Allow LanguageIndexingFilter to actually filter documents by language. > -- > > Key: NUTCH-2414 > URL: https://issues.apache.org/jira/browse/NUTCH-2414 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Minor > > It is often useful to only index pages in select languages (e.g. only those > languages that we intend to search in). At first glance it seems that this is > done by LanguageIndexingFilter, but currently all the filter does is add the > language as a field to the index. > We can add a configuration property to LanguageIndexingFilter that will allow > it to only index languages specified in this property. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144330#comment-16144330 ] Yossi Tamari commented on NUTCH-2414: - Markus, if I understand correctly, there are two ways to implement what you suggest: 1. Add the functionality to every indexer plugin (after all the IndexingFilters are run) 2. Write an additional IndexingFilter plugin that returns null if the JEXL expression is false. It will have to be configured to run after the other plugins that enrich the data. Which one are you suggesting? > Allow LanguageIndexingFilter to actually filter documents by language. > -- > > Key: NUTCH-2414 > URL: https://issues.apache.org/jira/browse/NUTCH-2414 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Minor > > It is often useful to only index pages in select languages (e.g. only those > languages that we intend to search in). At first glance it seems that this is > done by LanguageIndexingFilter, but currently all the filter does is add the > language as a field to the index. > We can add a configuration property to LanguageIndexingFilter that will allow > it to only index languages specified in this property. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: [jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
+1 This way one could have a very focused crawl/search On Mon, Aug 28, 2017 at 10:08 PM, Jorge Luis Betancourt Gonzalez (JIRA) < j...@apache.org> wrote: > > [ https://issues.apache.org/jira/browse/NUTCH-2414?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel=16144264#comment-16144264 ] > > Jorge Luis Betancourt Gonzalez commented on NUTCH-2414: > --- > > +1 This would allow also help to deprecate the {{mimetype-filter}} plugin > and avoid having the responsibility of indexing/allowing/blocking documents > (from being indexed) scattered across several plugins > > > Allow LanguageIndexingFilter to actually filter documents by language. > > -- > > > > Key: NUTCH-2414 > > URL: https://issues.apache.org/jira/browse/NUTCH-2414 > > Project: Nutch > > Issue Type: Improvement > > Components: plugin > >Affects Versions: 1.13 > >Reporter: Yossi Tamari > >Priority: Minor > > > > It is often useful to only index pages in select languages (e.g. only > those languages that we intend to search in). At first glance it seems that > this is done by LanguageIndexingFilter, but currently all the filter does > is add the language as a field to the index. > > We can add a configuration property to LanguageIndexingFilter that will > allow it to only index languages specified in this property. > > > > -- > This message was sent by Atlassian JIRA > (v6.4.14#64029) >
[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144264#comment-16144264 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2414: --- +1 This would allow also help to deprecate the {{mimetype-filter}} plugin and avoid having the responsibility of indexing/allowing/blocking documents (from being indexed) scattered across several plugins > Allow LanguageIndexingFilter to actually filter documents by language. > -- > > Key: NUTCH-2414 > URL: https://issues.apache.org/jira/browse/NUTCH-2414 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Minor > > It is often useful to only index pages in select languages (e.g. only those > languages that we intend to search in). At first glance it seems that this is > done by LanguageIndexingFilter, but currently all the filter does is add the > language as a field to the index. > We can add a configuration property to LanguageIndexingFilter that will allow > it to only index languages specified in this property. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144237#comment-16144237 ] Markus Jelsma commented on NUTCH-2414: -- Although filtering on lang field is a good idea, i think we can take this a further if we provide a filter that supports a whole variety of JEXL expressions. If an expression evaluates to true pass a document, otherwise discard it. This would solve your problem, and provide anyone a flexible means of discarding documents anyway they want. > Allow LanguageIndexingFilter to actually filter documents by language. > -- > > Key: NUTCH-2414 > URL: https://issues.apache.org/jira/browse/NUTCH-2414 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Minor > > It is often useful to only index pages in select languages (e.g. only those > languages that we intend to search in). At first glance it seems that this is > done by LanguageIndexingFilter, but currently all the filter does is add the > language as a field to the index. > We can add a configuration property to LanguageIndexingFilter that will allow > it to only index languages specified in this property. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143737#comment-16143737 ] ASF GitHub Bot commented on NUTCH-2414: --- pipldev opened a new pull request #217: NUTCH-2414 - Allow LanguageIndexingFilter to actually filter documents by language URL: https://github.com/apache/nutch/pull/217 Added property lang.index.languages. If exists and is not empty, it is treated as a comma-separated list of languages to index. A document in another language will not be indexed. "unknown" is a valid language code. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow LanguageIndexingFilter to actually filter documents by language. > -- > > Key: NUTCH-2414 > URL: https://issues.apache.org/jira/browse/NUTCH-2414 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.13 >Reporter: Yossi Tamari >Priority: Minor > > It is often useful to only index pages in select languages (e.g. only those > languages that we intend to search in). At first glance it seems that this is > done by LanguageIndexingFilter, but currently all the filter does is add the > language as a field to the index. > We can add a configuration property to LanguageIndexingFilter that will allow > it to only index languages specified in this property. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
Yossi Tamari created NUTCH-2414: --- Summary: Allow LanguageIndexingFilter to actually filter documents by language. Key: NUTCH-2414 URL: https://issues.apache.org/jira/browse/NUTCH-2414 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.13 Reporter: Yossi Tamari Priority: Minor It is often useful to only index pages in select languages (e.g. only those languages that we intend to search in). At first glance it seems that this is done by LanguageIndexingFilter, but currently all the filter does is add the language as a field to the index. We can add a configuration property to LanguageIndexingFilter that will allow it to only index languages specified in this property. -- This message was sent by Atlassian JIRA (v6.4.14#64029)