[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161520#comment-13161520 ]
dibyendu ghosh commented on NUTCH-1206: --------------------------------------- Output of my original test with 1.4: ======================= bash-2.00$ java TestParse direct.pdf Converting direct.pdf to html. All parsing attempts failed bash-2.00$ cat hadoop.log 2011-12-02 15:39:15,356 INFO plugin.PluginRepository - Plugins: looking in: /sp ace/dibyendu/nutch/1.4/runtime/local/plugins 2011-12-02 15:39:15,611 INFO plugin.PluginRepository - Plugin Auto-activation m ode: [true] 2011-12-02 15:39:15,611 INFO plugin.PluginRepository - Registered Plugins: 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - the nutch core e xtension points (nutch-extensionpoints) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Basic URL Normal izer (urlnormalizer-basic) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Html Parse Plug- in (parse-html) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Basic Indexing F ilter (index-basic) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Http / Https Pro tocol Plug-in (protocol-httpclient) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - HTTP Framework ( lib-http) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Regex URL Normal izer (urlnormalizer-regex) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Tika Parser Plug -in (parse-tika) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - OPIC Scoring Plu g-in (scoring-opic) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - CyberNeko HTML P arser (lib-nekohtml) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Registered Extension-Poi nts: 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Nutch URL Normal izer (org.apache.nutch.net.URLNormalizer) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Nutch Protocol ( org.apache.nutch.protocol.Protocol) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Segment Me rge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Indexing F ilter (org.apache.nutch.indexer.IndexingFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - HTML Parse Filte r (org.apache.nutch.parse.HtmlParseFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Content Pa rser (org.apache.nutch.parse.Parser) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Scoring (o rg.apache.nutch.scoring.ScoringFilter) 2011-12-02 15:39:16,794 WARN parse.ParseUtil - Unable to successfully parse con tent file:direct.pdf of type application/pdf 2011-12-02 15:39:16,885 WARN parse.ParseResult - file:direct.pdf is not parsed successfully, filtering bash-2.00$ echo $CLASSPATH conf:lib/nutch-1.4.jar:lib/log4j-1.2.15.jar:lib/commons-logging-1.1.1.jar:lib/ha doop-core-0.20.2.jar:lib/oro-2.0.8.jar:lib/tika-core-0.10.jar:lib/slf4j-api-1.6. 1.jar:lib/slf4j-log4j12-1.6.1.jar:. ======================= > tika parser of nutch 1.3 is failing to prcess pdfs > -------------------------------------------------- > > Key: NUTCH-1206 > URL: https://issues.apache.org/jira/browse/NUTCH-1206 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.3 > Environment: Solaris/Linux/Windows > Reporter: dibyendu ghosh > Assignee: Chris A. Mattmann > Attachments: direct.pdf > > > Please refer to this message: > http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old > parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) > though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does > not have parse-pdf plugin and it is not able to parse even older pdfs. > my code (TestParse.java): > ---------------------------- > bash-2.00$ cat TestParse.java > import java.io.File; > import java.io.FileInputStream; > import java.io.FileOutputStream; > import java.io.PrintStream; > import java.util.Iterator; > import java.util.Map; > import java.util.Map.Entry; > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.io.Text; > import org.apache.nutch.metadata.Metadata; > import org.apache.nutch.parse.ParseResult; > import org.apache.nutch.parse.Parse; > import org.apache.nutch.parse.ParseStatus; > import org.apache.nutch.parse.ParseUtil; > import org.apache.nutch.parse.ParseData; > import org.apache.nutch.protocol.Content; > import org.apache.nutch.util.NutchConfiguration; > public class TestParse { > private static Configuration conf = NutchConfiguration.create(); > public TestParse() { > } > public static void main(String[] args) { > String filename = args[0]; > convert(filename); > } > public static String convert(String fileName) { > String newName = "abc.html"; > try { > System.out.println("Converting " + fileName + " to html."); > if (convertToHtml(fileName, newName)) > return newName; > } catch (Exception e) { > (new File(newName)).delete(); > System.out.println("General exception " + e.getMessage()); > } > return null; > } > private static boolean convertToHtml(String fileName, String newName) > throws Exception { > // Read the file > FileInputStream in = new FileInputStream(fileName); > byte[] buf = new byte[in.available()]; > in.read(buf); > in.close(); > // Parse the file > Content content = new Content("file:" + fileName, "file:" + > fileName, > buf, "", new Metadata(), conf); > ParseResult parseResult = new ParseUtil(conf).parse(content); > parseResult.filter(); > if (parseResult.isEmpty()) { > System.out.println("All parsing attempts failed"); > return false; > } > Iterator<Map.Entry<Text,Parse>> iterator = > parseResult.iterator(); > if (iterator == null) { > System.out.println("Cannot iterate over successful parse > results"); > return false; > } > Parse parse = null; > ParseData parseData = null; > while (iterator.hasNext()) { > parse = parseResult.get((Text)iterator.next().getKey()); > parseData = parse.getData(); > ParseStatus status = parseData.getStatus(); > // If Parse failed then bail > if (!ParseStatus.STATUS_SUCCESS.equals(status)) { > System.out.println("Could not parse " + fileName + ". " + > status.getMessage()); > return false; > } > } > // Start writing to newName > FileOutputStream fout = new FileOutputStream(newName); > PrintStream out = new PrintStream(fout, true, "UTF-8"); > // Start Document > out.println("<html>"); > // Start Header > out.println("<head>"); > // Write Title > String title = parseData.getTitle(); > if (title != null && title.trim().length() > 0) { > out.println("<title>" + parseData.getTitle() + "</title>"); > } > // Write out Meta tags > Metadata metaData = parseData.getContentMeta(); > String[] names = metaData.names(); > for (String name : names) { > String[] subvalues = metaData.getValues(name); > String values = null; > for (String subvalue : subvalues) { > values += subvalue; > } > if (values.length() > 0) > out.printf("<meta name=\"%s\" content=\"%s\"/>\n", > name, values); > } > out.println("<meta http-equiv=\"Content-Type\" > content=\"text/html;charset=UTF-8\"/>"); > // End Meta tags > out.println("</head>"); // End Header > // Start Body > out.println("<body>"); > out.print(parse.getText()); > out.println("</body>"); // End Body > out.println("</html>"); // End Document > out.close(); // Close the file > return true; > } > } > ---------------------------- > command: > ====== > bash-2.00$ java -classpath > conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:. > TestParse direct.pdf > ====== > output: > _____ > Converting direct.pdf to html. > Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration > getConfResourceAsInputStream > INFO: found resource tika-mimetypes.xml at > file:/path/to/nutch/1.3/conf/tika-mimetypes.xml > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser > parsePluginFolder > INFO: Plugins: looking in: /path/to/nutch/1.3/runtime/local/plugins > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Plugin Auto-activation mode: [true] > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Registered Plugins: > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: the nutch core extension points (nutch-extensionpoints) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Tika Parser Plug-in (parse-tika) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Registered Extension-Points: > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch Protocol (org.apache.nutch.protocol.Protocol) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch Segment Merge Filter > (org.apache.nutch.segment.SegmentMergeFilter) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch URL Filter (org.apache.nutch.net.URLFilter) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: HTML Parse Filter > (org.apache.nutch.parse.HtmlParseFilter) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch Content Parser (org.apache.nutch.parse.Parser) > Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository > displayStatusINFO: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) > Oct 19, 2011 5:05:20 PM org.apache.hadoop.conf.Configuration > getConfResourceAsInputStream > INFO: found resource parse-plugins.xml at > file:/path/to/nutch/1.3/conf/parse-plugins.xml > Oct 19, 2011 5:05:20 PM org.apache.nutch.parse.ParserFactory matchExtensions > INFO: The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are > enabled via the plugin.includes system property, and all claim to support > the content type application/pdf, but they are not mapped to it in the > parse-plugins.xml file > Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseUtil parse > WARNING: Unable to successfully parse content file:direct.pdf of type > application/pdf > Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseResult filter > WARNING: file:direct.pdf is not parsed successfully, filtering > All parsing attempts failed > _____ > my customized nutch-site.xml: > ~~~~~~~~~~~~~~~~~~~~ > bash-2.00$ cat conf/nutch-site.xml > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <configuration> > <property> > <name>plugin.folders</name> > <value>runtime/local/plugins</value> > <description>Directories where nutch plugins are located. Each > element may be a relative or absolute path. If absolute, it is used > as is. If relative, it is searched for on the classpath.</description> > </property> > <property> > <name>plugin.includes</name> > <value>parse-tika</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > </description> > </property> > </configuration> > ~~~~~~~~~~~~~~~~~~~~ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira