[ https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609671#comment-14609671 ]
Markus Jelsma commented on NUTCH-1980: -------------------------------------- Committed to trunk in revision 1688569. > Jexl expressions for CrawlDbReader > ---------------------------------- > > Key: NUTCH-1980 > URL: https://issues.apache.org/jira/browse/NUTCH-1980 > Project: Nutch > Issue Type: New Feature > Components: crawldb > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.11 > > Attachments: NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch, > NUTCH-1980-1.9.patch, NUTCH-1980.patch > > > We are already using Jexl expressions to filter records from HostDb dumps and > it is really helpful when your CrawlDb is stuffed with metadata generated by > parser filters, in our case mostly scores generated by classification plugins > that run on text or structure. > In the case of the HostDb, it operates on hosts only, so it is easy to > collect a set of sites that host mostly a specific language, pornographic > content, or just host topics that your classifiers are trained for. > By adding this magic to the CrawlDbReader, you can get lists of actual > records that contain the stuff you are looking for. > Most work is already in the HostDb patch so it is easy to translate to > individual records. Patch tomorrow, probably... -- This message was sent by Atlassian JIRA (v6.3.4#6332)