Hi, I'm not able to reproduce the problem, at least, not with recent master (1.12 snapshot) and the default configuration:
% bin/nutch parsechecker 'http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg' fetching: http://www.sturmgewehr.com/... ... parsing: http://www.sturmgewehr.com/... contentType: image/jpeg ... Parse Metadata: X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser Resolution Units=none File Modified Date=Thu Mar 31 23:04:11 CEST 2016 Comments=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 Compression Type=Baseline Data Precision=8 bits Number of Components=3 tiff:ImageLength=240 Component 2=Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert w:comments=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 Component 1=Y component: Quantization table 0, Sampling factors 2 horiz/2 vert Image Height=240 pixels X Resolution=1 dot Image Width=240 pixels File Size=10351 bytes Component 3=Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert comment=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 JPEG Comment=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 File Name=apache-tika-8877046173076964154.tmp tiff:BitsPerSample=8 tiff:ImageWidth=240 Content-Type=image/jpeg Y Resolution=1 dot Is the error reproducible with parsechecker and the same config? The stack trace may indicate a version conflict of the commons-compress library. But the mime type is already not properly recognized. Which plugins are activated in nutch-site.xml? Sebastian On 03/31/2016 11:40 AM, Karanjeet Singh wrote: > Hello, > > I am getting below error *[0]* while parsing an image. It seems Tika is > detecting the URL > (http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg) > as application/gzip instead of an image/jpg. > > Can anyone shed some light on this? Or please confirm if it is a bug. > Meanwhile, I will be looking > into the code to see what is going wrong. I am working on the latest build. > > *[0]*: > > 2016-03-31 02:20:29,980 WARN parse.ParseUtil - Error parsing > http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg > with org.apache.nutch.parse.tika.TikaParser@48c56835 > > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: > org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V > > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > > at java.util.concurrent.FutureTask.get(FutureTask.java:202) > > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171) > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) > > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104) > > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:45) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.NoSuchMethodError: > org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V > > at > org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:120) > > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:132) > > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > > ... 4 more > > 2016-03-31 02:20:29,980 WARN parse.ParseUtil - Unable to successfully parse > content > http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg > of type application/gzip > > 2016-03-31 02:20:29,980 WARN parse.ParseSegment - Error parsing: > http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg: > failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully > parse content > > 2016-03-31 02:20:29,981 INFO cosine.CosineSimilarity - Setting score of > http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg > to 0.0 > > 2016-03-31 02:20:29,981 INFO parse.ParseSegment - Parsed > (19ms):http://www.sturmgewehr.com/forums/uploads/monthly_2016_01/412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg > > > Thanks & Regards, > Karanjeet Singh > CS Graduate Student > University of Southern California > karan...@usc.edu <mailto:karan...@usc.edu> | +1-213-675-9583 > <tel:%2B1-213-675-9583> > ᐧ