[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208332#comment-16208332
 ] 

Markus Jelsma commented on NUTCH-2439:
--

No idea, but probably someone on Tika's user list will so i opened a thread 
there. 

> Upgrade to Apache Tika 1.16
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208309#comment-16208309
 ] 

Sebastian Nagel commented on NUTCH-2439:


+1   Tika-core 1.16 already slept into as dependency of crawler-commons 0.8.

The Tika warnings to stderr are annoying. Looks like they cannot be supressed 
via Nutch's log4j.properties. Or is there a way?

> Upgrade to Apache Tika 1.16
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208251#comment-16208251
 ] 

Sebastian Nagel commented on NUTCH-2443:


+1 Good catch. There are actually a few more links missed, esp. in HTML5, cf. 
[this list of URL-value 
attributes|https://stackoverflow.com/questions/2725156/complete-list-of-html-tag-attributes-which-have-a-url-value].
 Nevertheless +1!

> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2411:
-
Attachment: NUTCH-2411.patch

Don't add empty fields.

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207778#comment-16207778
 ] 

ASF GitHub Bot commented on NUTCH-2443:
---

jorgelbg opened a new pull request #230: NUTCH-2443 add source tag to the 
parse-html and parse-tika outlink ex…
URL: https://github.com/apache/nutch/pull/230
 
 
   Add support for the `video`/`source` tag in the outlink extractor of the 
`parse-html` and `parse-tika` plugin. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2443:
-

 Summary: Extract links from the video tag with the parse-html 
plugin
 Key: NUTCH-2443
 URL: https://issues.apache.org/jira/browse/NUTCH-2443
 Project: Nutch
  Issue Type: Improvement
  Components: parser, plugin
Affects Versions: 1.13
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
Priority: Minor
 Fix For: 1.14


At the moment the {{parse-html}} extracts links from the tags {{a, area, form}} 
(configurable){{, frame, iframe, script, link, img}}. Since we allow extracting 
links to binary files (images) extracting links also from the {{video}} tag 
should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-10-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207219#comment-16207219
 ] 

Sebastian Nagel commented on NUTCH-2442:


Actually, it's a couple of jobs based on the new MapReduce API which do not 
check the return value of {{job.waitForCompletion(true)}} or call 
{{job.isSuccessful()}}. Cf. also the discussion in 
[NUTCH-2375|https://issues.apache.org/jira/browse/NUTCH-2375?focusedCommentId=16184721=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16184721].
 I'm working on a fix for the existing jobs (Injector, SitemapProcessor, 
ReadHostDb, and 3 classes in o.a.nutch.util).

> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-10-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2442:
--

 Summary: Injector to stop if job fails to avoid loss of CrawlDb
 Key: NUTCH-2442
 URL: https://issues.apache.org/jira/browse/NUTCH-2442
 Project: Nutch
  Issue Type: Bug
  Components: injector
Affects Versions: 1.13
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 1.14


Injector does not check whether the MapReduce job is successful. Even if the 
job fails
- installs the CrawlDb
-- move current/ to old/
-- replace current/ with an empty or potentially incomplete version
- exits with code 0 so that scripts running the crawl workflow cannot detect 
the failure -- if Injector is run a second time the CrawlDb is lost (both 
current/ and old/ are empty or corrupted)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)