Hi,

I'm not able to reproduce the problem, at least,
not with recent master (1.12 snapshot) and the default configuration:

% bin/nutch parsechecker
'http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg'
fetching: http://www.sturmgewehr.com/...
...
parsing: http://www.sturmgewehr.com/...
contentType: image/jpeg
...
Parse Metadata: X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser Resolution 
Units=none File
Modified Date=Thu Mar 31 23:04:11 CEST 2016 Comments=CREATOR: gd-jpeg v1.0 
(using IJG JPEG v80),
quality = 75
 Compression Type=Baseline Data Precision=8 bits Number of Components=3 
tiff:ImageLength=240
Component 2=Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert 
w:comments=CREATOR:
gd-jpeg v1.0 (using IJG JPEG v80), quality = 75
 Component 1=Y component: Quantization table 0, Sampling factors 2 horiz/2 vert 
Image Height=240
pixels X Resolution=1 dot Image Width=240 pixels File Size=10351 bytes 
Component 3=Cr component:
Quantization table 1, Sampling factors 1 horiz/1 vert comment=CREATOR: gd-jpeg 
v1.0 (using IJG JPEG
v80), quality = 75
 JPEG Comment=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 File
Name=apache-tika-8877046173076964154.tmp tiff:BitsPerSample=8 
tiff:ImageWidth=240
Content-Type=image/jpeg Y Resolution=1 dot

Is the error reproducible with parsechecker and the same config?

The stack trace may indicate a version conflict of the commons-compress library.
But the mime type is already not properly recognized.
Which plugins are activated in nutch-site.xml?

Sebastian

On 03/31/2016 11:40 AM, Karanjeet Singh wrote:
> Hello,
> 
> I am getting below error *[0]* while parsing an image. It seems Tika is 
> detecting the URL
> (http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg)
> as application/gzip instead of an image/jpg.
> 
> Can anyone shed some light on this? Or please confirm if it is a bug. 
> Meanwhile, I will be looking
> into the code to see what is going wrong. I am working on the latest build.
> 
> *[0]*:
> 
> 2016-03-31 02:20:29,980 WARN  parse.ParseUtil - Error parsing
> http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg
> with org.apache.nutch.parse.tika.TikaParser@48c56835
> 
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V
> 
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 
> at java.util.concurrent.FutureTask.get(FutureTask.java:202)
> 
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171)
> 
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
> 
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
> 
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:45)
> 
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> 
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> 
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> 
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> 
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 
> at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: java.lang.NoSuchMethodError:
> org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V
> 
> at 
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:120)
> 
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:132)
> 
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 
> ... 4 more
> 
> 2016-03-31 02:20:29,980 WARN  parse.ParseUtil - Unable to successfully parse 
> content
> http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg
> of type application/gzip
> 
> 2016-03-31 02:20:29,980 WARN  parse.ParseSegment - Error parsing:
> http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg:
> failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
> parse content
> 
> 2016-03-31 02:20:29,981 INFO  cosine.CosineSimilarity - Setting score of
> http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg
> to 0.0
> 
> 2016-03-31 02:20:29,981 INFO  parse.ParseSegment - Parsed
> (19ms):http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg
> 
> 
> Thanks & Regards,
> Karanjeet Singh
> CS Graduate Student
> University of Southern California
> karan...@usc.edu <mailto:karan...@usc.edu> | +1-213-675-9583 
> <tel:%2B1-213-675-9583>
> ᐧ

Reply via email to