tika parser of nutch 1.3 is failing to prcess pdfs
--------------------------------------------------

                 Key: NUTCH-1206
                 URL: https://issues.apache.org/jira/browse/NUTCH-1206
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.3
         Environment: Solaris/Linux/Windows
            Reporter: dibyendu ghosh


Please refer to this message: 
http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not 
have parse-pdf plugin and it is not able to parse even older pdfs.

my code (TestParse.java):
----------------------------
bash-2.00$ cat TestParse.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseStatus;
import org.apache.nutch.parse.ParseUtil;
import org.apache.nutch.parse.ParseData;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

public class TestParse {

    private static Configuration conf = NutchConfiguration.create();

    public TestParse() {
    }

    public static void main(String[] args) {
        String filename = args[0];
        convert(filename);
    }

    public static String convert(String fileName) {
        String newName = "abc.html";

        try {
            System.out.println("Converting " + fileName + " to html.");
            if (convertToHtml(fileName, newName))
                return newName;
        } catch (Exception e) {
            (new File(newName)).delete();
            System.out.println("General exception " + e.getMessage());
        }
        return null;
    }

    private static boolean convertToHtml(String fileName, String newName)
        throws Exception {
        // Read the file
        FileInputStream in = new FileInputStream(fileName);
        byte[] buf = new byte[in.available()];
        in.read(buf);
        in.close();

        // Parse the file
        Content content = new Content("file:" + fileName, "file:" +
fileName,
                                      buf, "", new Metadata(), conf);
        ParseResult parseResult = new ParseUtil(conf).parse(content);
        parseResult.filter();
        if (parseResult.isEmpty()) {
            System.out.println("All parsing attempts failed");
            return false;
        }
        Iterator<Map.Entry&lt;Text,Parse>> iterator =
parseResult.iterator();
        if (iterator == null) {
            System.out.println("Cannot iterate over successful parse
results");
            return false;
        }
        Parse parse = null;
        ParseData parseData = null;
        while (iterator.hasNext()) {
            parse = parseResult.get((Text)iterator.next().getKey());
            parseData = parse.getData();
            ParseStatus status = parseData.getStatus();

            // If Parse failed then bail
            if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
                System.out.println("Could not parse " + fileName + ". " +
                            status.getMessage());
                return false;
            }
        }

        // Start writing to newName
        FileOutputStream fout = new FileOutputStream(newName);
        PrintStream out = new PrintStream(fout, true, "UTF-8");

        // Start Document
        out.println("<html>");

        // Start Header
        out.println("<head>");

        // Write Title
        String title = parseData.getTitle();
        if (title != null && title.trim().length() > 0) {
            out.println("<title>" + parseData.getTitle() + "</title>");
        }

        // Write out Meta tags
        Metadata metaData = parseData.getContentMeta();
        String[] names = metaData.names();
        for (String name : names) {
            String[] subvalues = metaData.getValues(name);
            String values = null;
            for (String subvalue : subvalues) {
                values += subvalue;
            }
            if (values.length() > 0)
                out.printf("<meta name=\"%s\" content=\"%s\"/>\n",
                           name, values);
        }
        out.println("<meta http-equiv=\"Content-Type\"
content=\"text/html;charset=UTF-8\"/>");
        // End Meta tags

        out.println("</head>"); // End Header

        // Start Body
        out.println("<body>");
        out.print(parse.getText());
        out.println("</body>"); // End Body

        out.println("</html>"); // End Document

        out.close(); // Close the file

        return true;
    }
}
----------------------------

command:
======
bash-2.00$ java -classpath
conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:.
TestParse direct.pdf
======

output:
_____
Converting direct.pdf to html.
Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration
getConfResourceAsInputStream
INFO: found resource tika-mimetypes.xml at
file:/path/to/nutch/1.3/conf/tika-mimetypes.xml
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser
parsePluginFolder
INFO: Plugins: looking in: /path/to/nutch/1.3/runtime/local/plugins
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO: Plugin Auto-activation mode: [true]
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO: Registered Plugins:
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   the nutch core extension points (nutch-extensionpoints)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Tika Parser Plug-in (parse-tika)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO: Registered Extension-Points:
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch Protocol (org.apache.nutch.protocol.Protocol)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch Segment Merge Filter
(org.apache.nutch.segment.SegmentMergeFilter)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch URL Filter (org.apache.nutch.net.URLFilter)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch Content Parser (org.apache.nutch.parse.Parser)
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
displayStatusINFO:   Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
Oct 19, 2011 5:05:20 PM org.apache.hadoop.conf.Configuration
getConfResourceAsInputStream
INFO: found resource parse-plugins.xml at
file:/path/to/nutch/1.3/conf/parse-plugins.xml
Oct 19, 2011 5:05:20 PM org.apache.nutch.parse.ParserFactory matchExtensions
INFO: The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are
enabled via the plugin.includes system property, and all claim to support
the content type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file
Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseUtil parse
WARNING: Unable to successfully parse content file:direct.pdf of type
application/pdf
Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseResult filter
WARNING: file:direct.pdf is not parsed successfully, filtering
All parsing attempts failed
_____

my customized nutch-site.xml:
~~~~~~~~~~~~~~~~~~~~
bash-2.00$ cat conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
  <property>
    <name>plugin.folders</name>
    <value>runtime/local/plugins</value>
    <description>Directories where nutch plugins are located.  Each
    element may be a relative or absolute path.  If absolute, it is used
    as is.  If relative, it is searched for on the classpath.</description>
  </property>
  <property>
    <name>plugin.includes</name>
    <value>parse-tika</value>
    <description>Regular expression naming plugin directory names to
    include. Any plugin not matching this expression is excluded.
    </description>
  </property>
</configuration>
~~~~~~~~~~~~~~~~~~~~


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to