[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2012-02-09 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204397#comment-13204397
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

I could not find any issue with my settings. Most probably, the plugin 
protocol-htttpclient is totally independent of tika parser. For my purpose, I 
need to be able to parse local files, which I could do earlier using parse-pdf 
(except for Acrobat 9.0+ files), I cannot access files via http. So, the 
protocol-httpclient is not useful for my purpose. Is there any other plugin 
that I can use to parse local pdf files?

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161501#comment-13161501
 ] 

Markus Jelsma commented on NUTCH-1206:
--

fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf
Can't fetch URL successfully

This is obviously not a parser problem as it tells you it's a fetcher problem. 
Also, can you fetch httpS url's at all with the protocol plugin you use?


 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161508#comment-13161508
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Well, the original problem I am getting still stands with 1.4. The above output 
came from the alternative test what Julien suggested. For my purpose, this test 
doesn't help, but if it worked then probably it would prove that something is 
wrong with my settings. I tried the settings suggested by Julien for the test, 
can someone please try with 1.4 and check whether the parsechecker test is 
working for you? Julien tried it with trunk, not 1.4.

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161516#comment-13161516
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

This is my nutch-site.xml in the test:
===
bash-2.00$ cat conf/nutch-site.xml
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
nameplugin.includes/name
valueprotocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor
)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
/property
/configuration
===

This is exactly same as Julien's config.

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161521#comment-13161521
 ] 

Markus Jelsma commented on NUTCH-1206:
--

I see. Check your logs for something peculiar. I can fetch and parse this file 
with Nutch 1.4 with protocol-htttpclient. 

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
 values += subvalue;
 }
 if (values.length()  0)
 out.printf(meta 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161520#comment-13161520
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Output of my original test with 1.4:
===
bash-2.00$ java TestParse direct.pdf
Converting direct.pdf to html.
All parsing attempts failed
bash-2.00$ cat hadoop.log
2011-12-02 15:39:15,356 INFO  plugin.PluginRepository - Plugins: looking in: /sp
ace/dibyendu/nutch/1.4/runtime/local/plugins
2011-12-02 15:39:15,611 INFO  plugin.PluginRepository - Plugin Auto-activation m
ode: [true]
2011-12-02 15:39:15,611 INFO  plugin.PluginRepository - Registered Plugins:
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - the nutch core e
xtension points (nutch-extensionpoints)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - Basic URL Normal
izer (urlnormalizer-basic)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - Html Parse Plug-
in (parse-html)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - Basic Indexing F
ilter (index-basic)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - Http / Https Pro
tocol Plug-in (protocol-httpclient)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - HTTP Framework (
lib-http)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - Regex URL Filter
 (urlfilter-regex)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository - Pass-through URL
 Normalizer (urlnormalizer-pass)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Regex URL Normal
izer (urlnormalizer-regex)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Tika Parser Plug
-in (parse-tika)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - OPIC Scoring Plu
g-in (scoring-opic)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - CyberNeko HTML P
arser (lib-nekohtml)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Anchor Indexing
Filter (index-anchor)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Regex URL Filter
 Framework (lib-regex-filter)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Registered Extension-Poi
nts:
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Nutch URL Normal
izer (org.apache.nutch.net.URLNormalizer)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Nutch Protocol (
org.apache.nutch.protocol.Protocol)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository - Nutch Segment Me
rge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository - Nutch URL Filter
 (org.apache.nutch.net.URLFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository - Nutch Indexing F
ilter (org.apache.nutch.indexer.IndexingFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository - HTML Parse Filte
r (org.apache.nutch.parse.HtmlParseFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository - Nutch Content Pa
rser (org.apache.nutch.parse.Parser)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository - Nutch Scoring (o
rg.apache.nutch.scoring.ScoringFilter)
2011-12-02 15:39:16,794 WARN  parse.ParseUtil - Unable to successfully parse con
tent file:direct.pdf of type application/pdf
2011-12-02 15:39:16,885 WARN  parse.ParseResult - file:direct.pdf is not parsed
successfully, filtering
bash-2.00$ echo $CLASSPATH
conf:lib/nutch-1.4.jar:lib/log4j-1.2.15.jar:lib/commons-logging-1.1.1.jar:lib/ha
doop-core-0.20.2.jar:lib/oro-2.0.8.jar:lib/tika-core-0.10.jar:lib/slf4j-api-1.6.
1.jar:lib/slf4j-log4j12-1.6.1.jar:.
===

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161536#comment-13161536
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Thanks. Had to set proxy-host and url in nutch-site.xml. So, parsechecker works 
with nutch 1.4. I'll try on fixing my conf settings (at this point I have no 
idea what is wrong, or what else is needed); any suggestions on that will be 
most welcome.

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-01 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161463#comment-13161463
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Tried with 1.4. Its still not working. 1.3 did not have parsechecker option for 
nutch script. 1.4 is showing the following output:
===
bash-2.00$ bin/nutch parsechecker -dumpText https://issues.apache.org/jira/secu
re/attachment/12505323/direct.pdf
fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf
Can't fetch URL successfully
===
This is after keeping the above mentioned conf. setting in nutch-site.xml

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-28 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158366#comment-13158366
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Hi Chris,
I have attached the direct.pdf file. You can also test with any simple pdf, for 
example, by exporting to pdf from a open office document. Results are same.

Noticed that Nutch 1.4 has got released on 24th Nov. Will update after testing 
with that.

Thanks,
Dibyendu

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-26 Thread Chris A. Mattmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157654#comment-13157654
 ] 

Chris A. Mattmann commented on NUTCH-1206:
--

Hi Dibyendu,

Can you please post direct.pdf? Or send me the URL for it? You can use the 
bin/nutch org.apache.nutch.parse.ParserChecker program to evaluate whether or 
not Nutch will parse your content. You could also try upgrading to 1.4 and see 
if that helps.


 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann

 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
  

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-23 Thread Chris A. Mattmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13156508#comment-13156508
 ] 

Chris A. Mattmann commented on NUTCH-1206:
--

OK, I might be seeing this too with 1.3, and even 1.4. I'm going to look more 
into this but just wanted to assign it to myself since I'm interested in fixing 
this now.

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann

 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
 values += subvalue;
 }
 if (values.length()  0)
 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-21 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154180#comment-13154180
 ] 

Markus Jelsma commented on NUTCH-1206:
--

Have you tried the Nutch trunk or the most recent Tika as suggested? 

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh

 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
 values += subvalue;
 }
 if (values.length()  0)
 out.printf(meta name=\%s\ content=\%s\/\n,
name, values);
 }
 out.println(meta http-equiv=\Content-Type\