Re: Nutch: Tika Parser error while parsing an image

2016-04-08 Thread Karanjeet Singh
Hi Sebastian,

Thanks for all your help. I have managed to solve the problem.

I was referring to the protocol-selenium code (which was originally
modified from protocol-http) and using it in HtmlUnit plugin. It seems
there were few cases that weren't handled properly. That is why if you try
to change the plugin to protocol-selenium and run parsechecker, you will
get the same issue.

I will update the HtmlUnit PR and raise an issue for Selenium as well.

Once again, thanks for all your help.

Regards,
Karanjeet Singh
C.S. Graduate Student
University of Southern California

On Fri, Apr 8, 2016 at 3:18 AM, Karanjeet Singh  wrote:

> Hi Sebastian,
>
> Thanks for your response on this.
>
> I am developing a new plugin protocol-htmlunit [0] for Nutch where I am
> facing this issue. Sorry, I didn't mention this in my previous email but I
> wonder how this has affected Tika content type detection.
>
> The plugin has not yet merged with Nutch but you can pick the updates and
> enable the plugin on your local system to test.
>
> The image parsing error is for all the images using protocol-htmlunit and
> yes this doesn't come when using protocol-http protocol.
>
> Any ideas what I am doing wrong? Appreciate your help.
>
> [0]: https://github.com/apache/nutch/pull/100
>
>
> P.S.: I am also developing interactive htmlunit handlers [1] (just like
> Selenium) in case you are interested to have a look.
>
> [1]:
> https://github.com/karanjeets/FocusedCrawl-Weapons/tree/master/src/main/java/edu/usc/cs/ir/htmlunit/handler
>
>
> Thanks & Regards,
> Karanjeet Singh
> C.S. Graduate Student
> University of Southern California
>
>
> On Thu, Mar 31, 2016 at 2:19 PM, Sebastian Nagel <
> wastl.na...@googlemail.com> wrote:
>
>> Hi,
>>
>> I'm not able to reproduce the problem, at least,
>> not with recent master (1.12 snapshot) and the default configuration:
>>
>> % bin/nutch parsechecker
>> '
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE=
>> '
>> fetching:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=tcwDNixM_kDuk3n2rhA-viTbYxhEhyeSauPhPY5kg7w=
>> ...
>> ...
>> parsing:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=tcwDNixM_kDuk3n2rhA-viTbYxhEhyeSauPhPY5kg7w=
>> ...
>> contentType: image/jpeg
>> ...
>> Parse Metadata: X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser
>> Resolution Units=none File
>> Modified Date=Thu Mar 31 23:04:11 CEST 2016 Comments=CREATOR: gd-jpeg
>> v1.0 (using IJG JPEG v80),
>> quality = 75
>>  Compression Type=Baseline Data Precision=8 bits Number of Components=3
>> tiff:ImageLength=240
>> Component 2=Cb component: Quantization table 1, Sampling factors 1
>> horiz/1 vert w:comments=CREATOR:
>> gd-jpeg v1.0 (using IJG JPEG v80), quality = 75
>>  Component 1=Y component: Quantization table 0, Sampling factors 2
>> horiz/2 vert Image Height=240
>> pixels X Resolution=1 dot Image Width=240 pixels File Size=10351 bytes
>> Component 3=Cr component:
>> Quantization table 1, Sampling factors 1 horiz/1 vert comment=CREATOR:
>> gd-jpeg v1.0 (using IJG JPEG
>> v80), quality = 75
>>  JPEG Comment=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75
>> File
>> Name=apache-tika-8877046173076964154.tmp tiff:BitsPerSample=8
>> tiff:ImageWidth=240
>> Content-Type=image/jpeg Y Resolution=1 dot
>>
>> Is the error reproducible with parsechecker and the same config?
>>
>> The stack trace may indicate a version conflict of the commons-compress
>> library.
>> But the mime type is already not properly recognized.
>> Which plugins are activated in nutch-site.xml?
>>
>> Sebastian
>>
>> On 03/31/2016 11:40 AM, Karanjeet Singh wrote:
>> > Hello,
>> >
>> > I am getting below error *[0]* while parsing an image. It seems Tika is
>> detecting the URL
>> > (
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE=
>> )
>> > as application/gzip instead of an image/jpg.
>> >
>> > Can anyone shed some light on this? Or please confirm if it is a bug.
>> Meanwhile, I will be looking
>> > into the code to see what is going wrong. I am working on the latest
>> build.
>> >
>> > *[0]*:
>> >
>> > 2016-03-31 

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-04-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2191:
-
Labels: memex  (was: )

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch, 
> NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch: Tika Parser error while parsing an image

2016-04-08 Thread Karanjeet Singh
Hi Sebastian,

Thanks for your response on this.

I am developing a new plugin protocol-htmlunit [0] for Nutch where I am
facing this issue. Sorry, I didn't mention this in my previous email but I
wonder how this has affected Tika content type detection.

The plugin has not yet merged with Nutch but you can pick the updates and
enable the plugin on your local system to test.

The image parsing error is for all the images using protocol-htmlunit and
yes this doesn't come when using protocol-http protocol.

Any ideas what I am doing wrong? Appreciate your help.

[0]: https://github.com/apache/nutch/pull/100


P.S.: I am also developing interactive htmlunit handlers [1] (just like
Selenium) in case you are interested to have a look.

[1]:
https://github.com/karanjeets/FocusedCrawl-Weapons/tree/master/src/main/java/edu/usc/cs/ir/htmlunit/handler


Thanks & Regards,
Karanjeet Singh
C.S. Graduate Student
University of Southern California


On Thu, Mar 31, 2016 at 2:19 PM, Sebastian Nagel  wrote:

> Hi,
>
> I'm not able to reproduce the problem, at least,
> not with recent master (1.12 snapshot) and the default configuration:
>
> % bin/nutch parsechecker
> '
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE=
> '
> fetching:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=tcwDNixM_kDuk3n2rhA-viTbYxhEhyeSauPhPY5kg7w=
> ...
> ...
> parsing:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=tcwDNixM_kDuk3n2rhA-viTbYxhEhyeSauPhPY5kg7w=
> ...
> contentType: image/jpeg
> ...
> Parse Metadata: X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser
> Resolution Units=none File
> Modified Date=Thu Mar 31 23:04:11 CEST 2016 Comments=CREATOR: gd-jpeg v1.0
> (using IJG JPEG v80),
> quality = 75
>  Compression Type=Baseline Data Precision=8 bits Number of Components=3
> tiff:ImageLength=240
> Component 2=Cb component: Quantization table 1, Sampling factors 1 horiz/1
> vert w:comments=CREATOR:
> gd-jpeg v1.0 (using IJG JPEG v80), quality = 75
>  Component 1=Y component: Quantization table 0, Sampling factors 2 horiz/2
> vert Image Height=240
> pixels X Resolution=1 dot Image Width=240 pixels File Size=10351 bytes
> Component 3=Cr component:
> Quantization table 1, Sampling factors 1 horiz/1 vert comment=CREATOR:
> gd-jpeg v1.0 (using IJG JPEG
> v80), quality = 75
>  JPEG Comment=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 File
> Name=apache-tika-8877046173076964154.tmp tiff:BitsPerSample=8
> tiff:ImageWidth=240
> Content-Type=image/jpeg Y Resolution=1 dot
>
> Is the error reproducible with parsechecker and the same config?
>
> The stack trace may indicate a version conflict of the commons-compress
> library.
> But the mime type is already not properly recognized.
> Which plugins are activated in nutch-site.xml?
>
> Sebastian
>
> On 03/31/2016 11:40 AM, Karanjeet Singh wrote:
> > Hello,
> >
> > I am getting below error *[0]* while parsing an image. It seems Tika is
> detecting the URL
> > (
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE=
> )
> > as application/gzip instead of an image/jpg.
> >
> > Can anyone shed some light on this? Or please confirm if it is a bug.
> Meanwhile, I will be looking
> > into the code to see what is going wrong. I am working on the latest
> build.
> >
> > *[0]*:
> >
> > 2016-03-31 02:20:29,980 WARN  parse.ParseUtil - Error parsing
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg=CwIDaQ=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=u7neGGUaVmQKNSLUqJ9zpA=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE=
> > with org.apache.nutch.parse.tika.TikaParser@48c56835
> >
> > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> >
> org.apache.commons.compress.compressors.CompressorStreamFactory.(Z)V
> >
> > at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> >
> > at java.util.concurrent.FutureTask.get(FutureTask.java:202)
> >
> > at