[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208332#comment-16208332 ] Markus Jelsma commented on NUTCH-2439: -- No idea, but probably someone on Tika's user list will so i opened a thread there. > Upgrade to Apache Tika 1.16 > --- > > Key: NUTCH-2439 > URL: https://issues.apache.org/jira/browse/NUTCH-2439 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2439.patch, NUTCH-2439.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16
[ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208309#comment-16208309 ] Sebastian Nagel commented on NUTCH-2439: +1 Tika-core 1.16 already slept into as dependency of crawler-commons 0.8. The Tika warnings to stderr are annoying. Looks like they cannot be supressed via Nutch's log4j.properties. Or is there a way? > Upgrade to Apache Tika 1.16 > --- > > Key: NUTCH-2439 > URL: https://issues.apache.org/jira/browse/NUTCH-2439 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2439.patch, NUTCH-2439.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208251#comment-16208251 ] Sebastian Nagel commented on NUTCH-2443: +1 Good catch. There are actually a few more links missed, esp. in HTML5, cf. [this list of URL-value attributes|https://stackoverflow.com/questions/2725156/complete-list-of-html-tag-attributes-which-have-a-url-value]. Nevertheless +1! > Extract links from the video tag with the parse-html plugin > --- > > Key: NUTCH-2443 > URL: https://issues.apache.org/jira/browse/NUTCH-2443 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin >Affects Versions: 1.13 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Fix For: 1.14 > > > At the moment the {{parse-html}} extracts links from the tags {{a, area, > form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow > extracting links to binary files (images) extracting links also from the > {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch Don't add empty fields. > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207778#comment-16207778 ] ASF GitHub Bot commented on NUTCH-2443: --- jorgelbg opened a new pull request #230: NUTCH-2443 add source tag to the parse-html and parse-tika outlink ex… URL: https://github.com/apache/nutch/pull/230 Add support for the `video`/`source` tag in the outlink extractor of the `parse-html` and `parse-tika` plugin. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Extract links from the video tag with the parse-html plugin > --- > > Key: NUTCH-2443 > URL: https://issues.apache.org/jira/browse/NUTCH-2443 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin >Affects Versions: 1.13 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > Fix For: 1.14 > > > At the moment the {{parse-html}} extracts links from the tags {{a, area, > form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow > extracting links to binary files (images) extracting links also from the > {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2443) Extract links from the video tag with the parse-html plugin
Jorge Luis Betancourt Gonzalez created NUTCH-2443: - Summary: Extract links from the video tag with the parse-html plugin Key: NUTCH-2443 URL: https://issues.apache.org/jira/browse/NUTCH-2443 Project: Nutch Issue Type: Improvement Components: parser, plugin Affects Versions: 1.13 Reporter: Jorge Luis Betancourt Gonzalez Assignee: Jorge Luis Betancourt Gonzalez Priority: Minor Fix For: 1.14 At the moment the {{parse-html}} extracts links from the tags {{a, area, form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow extracting links to binary files (images) extracting links also from the {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb
[ https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207219#comment-16207219 ] Sebastian Nagel commented on NUTCH-2442: Actually, it's a couple of jobs based on the new MapReduce API which do not check the return value of {{job.waitForCompletion(true)}} or call {{job.isSuccessful()}}. Cf. also the discussion in [NUTCH-2375|https://issues.apache.org/jira/browse/NUTCH-2375?focusedCommentId=16184721=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16184721]. I'm working on a fix for the existing jobs (Injector, SitemapProcessor, ReadHostDb, and 3 classes in o.a.nutch.util). > Injector to stop if job fails to avoid loss of CrawlDb > -- > > Key: NUTCH-2442 > URL: https://issues.apache.org/jira/browse/NUTCH-2442 > Project: Nutch > Issue Type: Bug > Components: injector >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.14 > > > Injector does not check whether the MapReduce job is successful. Even if the > job fails > - installs the CrawlDb > -- move current/ to old/ > -- replace current/ with an empty or potentially incomplete version > - exits with code 0 so that scripts running the crawl workflow cannot detect > the failure -- if Injector is run a second time the CrawlDb is lost (both > current/ and old/ are empty or corrupted) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb
Sebastian Nagel created NUTCH-2442: -- Summary: Injector to stop if job fails to avoid loss of CrawlDb Key: NUTCH-2442 URL: https://issues.apache.org/jira/browse/NUTCH-2442 Project: Nutch Issue Type: Bug Components: injector Affects Versions: 1.13 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.14 Injector does not check whether the MapReduce job is successful. Even if the job fails - installs the CrawlDb -- move current/ to old/ -- replace current/ with an empty or potentially incomplete version - exits with code 0 so that scripts running the crawl workflow cannot detect the failure -- if Injector is run a second time the CrawlDb is lost (both current/ and old/ are empty or corrupted) -- This message was sent by Atlassian JIRA (v6.4.14#64029)