[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144357#comment-16144357
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
---

[~yossi] I think that [~markus.jel...@openindex.io] is suggesting implementing 
a generic {{IndexingFilter}} that supports JEXL expressions, this way we don't 
need to modify every possible {{IndexingFilter}}, this will be easier to 
maintain in the long run and provides a better separation.

> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Yossi Tamari (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144330#comment-16144330
 ] 

Yossi Tamari commented on NUTCH-2414:
-

Markus, if I understand correctly, there are two ways to implement what you 
suggest:
1. Add the functionality to every indexer plugin (after all the IndexingFilters 
are run)
2. Write an additional IndexingFilter plugin that returns null if the JEXL 
expression is false. It will have to be configured to run after the other 
plugins that enrich the data.
Which one are you suggesting?

> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread BlackIce
+1 This way one could have a very focused crawl/search

On Mon, Aug 28, 2017 at 10:08 PM, Jorge Luis Betancourt Gonzalez (JIRA) <
j...@apache.org> wrote:

>
> [ https://issues.apache.org/jira/browse/NUTCH-2414?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=16144264#comment-16144264 ]
>
> Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
> ---
>
> +1 This would allow also help to deprecate the {{mimetype-filter}} plugin
> and avoid having the responsibility of indexing/allowing/blocking documents
> (from being indexed) scattered across several plugins
>
> > Allow LanguageIndexingFilter to actually filter documents by language.
> > --
> >
> > Key: NUTCH-2414
> > URL: https://issues.apache.org/jira/browse/NUTCH-2414
> > Project: Nutch
> >  Issue Type: Improvement
> >  Components: plugin
> >Affects Versions: 1.13
> >Reporter: Yossi Tamari
> >Priority: Minor
> >
> > It is often useful to only index pages in select languages (e.g. only
> those languages that we intend to search in). At first glance it seems that
> this is done by LanguageIndexingFilter, but currently all the filter does
> is add the language as a field to the index.
> > We can add a configuration property to LanguageIndexingFilter that will
> allow it to only index languages specified in this property.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
>


[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144264#comment-16144264
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
---

+1 This would allow also help to deprecate the {{mimetype-filter}} plugin and 
avoid having the responsibility of indexing/allowing/blocking documents (from 
being indexed) scattered across several plugins

> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144237#comment-16144237
 ] 

Markus Jelsma commented on NUTCH-2414:
--

Although filtering on lang field is a good idea, i think we can take this a 
further if we provide a filter that supports a whole variety of JEXL 
expressions. If an expression evaluates to true pass a document, otherwise 
discard it.

This would solve your problem, and provide anyone a flexible means of 
discarding documents anyway they want.

> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143737#comment-16143737
 ] 

ASF GitHub Bot commented on NUTCH-2414:
---

pipldev opened a new pull request #217: NUTCH-2414 - Allow 
LanguageIndexingFilter to actually filter documents by language
URL: https://github.com/apache/nutch/pull/217
 
 
   Added property lang.index.languages. If exists and is not empty, it is 
treated as a comma-separated list of languages to index. A document in another 
language will not be indexed.
   "unknown" is a valid language code.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2414:
---

 Summary: Allow LanguageIndexingFilter to actually filter documents 
by language.
 Key: NUTCH-2414
 URL: https://issues.apache.org/jira/browse/NUTCH-2414
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.13
Reporter: Yossi Tamari
Priority: Minor


It is often useful to only index pages in select languages (e.g. only those 
languages that we intend to search in). At first glance it seems that this is 
done by LanguageIndexingFilter, but currently all the filter does is add the 
language as a field to the index.
We can add a configuration property to LanguageIndexingFilter that will allow 
it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)