[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204397#comment-13204397 ] dibyendu ghosh commented on NUTCH-1206: --- I could not find any issue with my settings. Most probably, the plugin protocol-htttpclient is totally independent of tika parser. For my purpose, I need to be able to parse local files, which I could do earlier using parse-pdf (except for Acrobat 9.0+ files), I cannot access files via http. So, the protocol-httpclient is not useful for my purpose. Is there any other plugin that I can use to parse local pdf files? tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names();
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161501#comment-13161501 ] Markus Jelsma commented on NUTCH-1206: -- fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf Can't fetch URL successfully This is obviously not a parser problem as it tells you it's a fetcher problem. Also, can you fetch httpS url's at all with the protocol plugin you use? tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue :
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161508#comment-13161508 ] dibyendu ghosh commented on NUTCH-1206: --- Well, the original problem I am getting still stands with 1.4. The above output came from the alternative test what Julien suggested. For my purpose, this test doesn't help, but if it worked then probably it would prove that something is wrong with my settings. I tried the settings suggested by Julien for the test, can someone please try with 1.4 and check whether the parsechecker test is working for you? Julien tried it with trunk, not 1.4. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161516#comment-13161516 ] dibyendu ghosh commented on NUTCH-1206: --- This is my nutch-site.xml in the test: === bash-2.00$ cat conf/nutch-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor )|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property /configuration === This is exactly same as Julien's config. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData =
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161521#comment-13161521 ] Markus Jelsma commented on NUTCH-1206: -- I see. Check your logs for something peculiar. I can fetch and parse this file with Nutch 1.4 with protocol-htttpclient. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0) out.printf(meta
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161520#comment-13161520 ] dibyendu ghosh commented on NUTCH-1206: --- Output of my original test with 1.4: === bash-2.00$ java TestParse direct.pdf Converting direct.pdf to html. All parsing attempts failed bash-2.00$ cat hadoop.log 2011-12-02 15:39:15,356 INFO plugin.PluginRepository - Plugins: looking in: /sp ace/dibyendu/nutch/1.4/runtime/local/plugins 2011-12-02 15:39:15,611 INFO plugin.PluginRepository - Plugin Auto-activation m ode: [true] 2011-12-02 15:39:15,611 INFO plugin.PluginRepository - Registered Plugins: 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - the nutch core e xtension points (nutch-extensionpoints) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Basic URL Normal izer (urlnormalizer-basic) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Html Parse Plug- in (parse-html) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Basic Indexing F ilter (index-basic) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Http / Https Pro tocol Plug-in (protocol-httpclient) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - HTTP Framework ( lib-http) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-12-02 15:39:15,612 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Regex URL Normal izer (urlnormalizer-regex) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Tika Parser Plug -in (parse-tika) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - OPIC Scoring Plu g-in (scoring-opic) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - CyberNeko HTML P arser (lib-nekohtml) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Registered Extension-Poi nts: 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Nutch URL Normal izer (org.apache.nutch.net.URLNormalizer) 2011-12-02 15:39:15,613 INFO plugin.PluginRepository - Nutch Protocol ( org.apache.nutch.protocol.Protocol) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Segment Me rge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Indexing F ilter (org.apache.nutch.indexer.IndexingFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - HTML Parse Filte r (org.apache.nutch.parse.HtmlParseFilter) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Content Pa rser (org.apache.nutch.parse.Parser) 2011-12-02 15:39:15,614 INFO plugin.PluginRepository - Nutch Scoring (o rg.apache.nutch.scoring.ScoringFilter) 2011-12-02 15:39:16,794 WARN parse.ParseUtil - Unable to successfully parse con tent file:direct.pdf of type application/pdf 2011-12-02 15:39:16,885 WARN parse.ParseResult - file:direct.pdf is not parsed successfully, filtering bash-2.00$ echo $CLASSPATH conf:lib/nutch-1.4.jar:lib/log4j-1.2.15.jar:lib/commons-logging-1.1.1.jar:lib/ha doop-core-0.20.2.jar:lib/oro-2.0.8.jar:lib/tika-core-0.10.jar:lib/slf4j-api-1.6. 1.jar:lib/slf4j-log4j12-1.6.1.jar:. === tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161536#comment-13161536 ] dibyendu ghosh commented on NUTCH-1206: --- Thanks. Had to set proxy-host and url in nutch-site.xml. So, parsechecker works with nutch 1.4. I'll try on fixing my conf settings (at this point I have no idea what is wrong, or what else is needed); any suggestions on that will be most welcome. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) {
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161463#comment-13161463 ] dibyendu ghosh commented on NUTCH-1206: --- Tried with 1.4. Its still not working. 1.3 did not have parsechecker option for nutch script. 1.4 is showing the following output: === bash-2.00$ bin/nutch parsechecker -dumpText https://issues.apache.org/jira/secu re/attachment/12505323/direct.pdf fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf Can't fetch URL successfully === This is after keeping the above mentioned conf. setting in nutch-site.xml tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158366#comment-13158366 ] dibyendu ghosh commented on NUTCH-1206: --- Hi Chris, I have attached the direct.pdf file. You can also test with any simple pdf, for example, by exporting to pdf from a open office document. Results are same. Noticed that Nutch 1.4 has got released on 24th Nov. Will update after testing with that. Thanks, Dibyendu tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157654#comment-13157654 ] Chris A. Mattmann commented on NUTCH-1206: -- Hi Dibyendu, Can you please post direct.pdf? Or send me the URL for it? You can use the bin/nutch org.apache.nutch.parse.ParserChecker program to evaluate whether or not Nutch will parse your content. You could also try upgrading to 1.4 and see if that helps. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) {
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13156508#comment-13156508 ] Chris A. Mattmann commented on NUTCH-1206: -- OK, I might be seeing this too with 1.3, and even 1.4. I'm going to look more into this but just wanted to assign it to myself since I'm interested in fixing this now. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0)
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154180#comment-13154180 ] Markus Jelsma commented on NUTCH-1206: -- Have you tried the Nutch trunk or the most recent Tika as suggested? tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0) out.printf(meta name=\%s\ content=\%s\/\n, name, values); } out.println(meta http-equiv=\Content-Type\