[ 
https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609671#comment-14609671
 ] 

Markus Jelsma commented on NUTCH-1980:
--------------------------------------

Committed to trunk in revision 1688569.

> Jexl expressions for CrawlDbReader
> ----------------------------------
>
>                 Key: NUTCH-1980
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1980
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch, 
> NUTCH-1980-1.9.patch, NUTCH-1980.patch
>
>
> We are already using Jexl expressions to filter records from HostDb dumps and 
> it is really helpful when your CrawlDb is stuffed with metadata generated by 
> parser filters, in our case mostly scores generated by classification plugins 
> that run on text or structure.
> In the case of the HostDb, it operates on hosts only, so it is easy to 
> collect a set of sites that host mostly a specific language, pornographic 
> content, or just host topics that your classifiers are trained for.
> By adding this magic to the CrawlDbReader, you can get lists of actual 
> records that contain the stuff you are looking for.
> Most work is already in the HostDb patch so it is easy to translate to 
> individual records. Patch tomorrow, probably...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to