[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208332#comment-16208332
 ] 

Markus Jelsma commented on NUTCH-2439:
--

No idea, but probably someone on Tika's user list will so i opened a thread 
there. 

> Upgrade to Apache Tika 1.16
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208309#comment-16208309
 ] 

Sebastian Nagel commented on NUTCH-2439:


+1   Tika-core 1.16 already slept into as dependency of crawler-commons 0.8.

The Tika warnings to stderr are annoying. Looks like they cannot be supressed 
via Nutch's log4j.properties. Or is there a way?

> Upgrade to Apache Tika 1.16
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-11 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200495#comment-16200495
 ] 

Markus Jelsma commented on NUTCH-2439:
--

Ah, i removed slf4j-api from plugin.xml and it works. But errors are 
logged:fetching: https://www.sitesearch.io/
robots.txt whitelist not configured.
{code}
fetching: https://www.sitesearch.io/
robots.txt whitelist not configured.
Oct 11, 2017 5:50:50 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Oct 11, 2017 5:50:50 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
parsing: https://www.sitesearch.io/
{code}

> Upgrade to Apache Tika 1.16
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)