[
https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495229
]
Sami Siren commented on NUTCH-472:
----------------------------------
> Not sure how to turn source code in description into a patch file, but the
> fixed "extractText" method was included earlier.
You can follow instructions on how to create patches on Nutch wiki
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
> NullPointerException in ZipTextExtractor if no MIME type for zipped file
> ------------------------------------------------------------------------
>
> Key: NUTCH-472
> URL: https://issues.apache.org/jira/browse/NUTCH-472
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 0.9.0
> Environment: Any
> Reporter: Antony Bowesman
>
> extractText throws a NPE in
> String contentType = MIME.getMimeType(fname).getName();
> if the file in the zip has no configured mime type which breaks the parsing
> of the zip.
> Code should do:
> public String extractText(InputStream input, String url, List outLinksList)
> throws IOException {
> String resultText = "";
> byte temp;
>
> ZipInputStream zin = new ZipInputStream(input);
>
> ZipEntry entry;
>
> while ((entry = zin.getNextEntry()) != null) {
>
> if (!entry.isDirectory()) {
> int size = (int) entry.getSize();
> byte[] b = new byte[size];
> for(int x = 0; x < size; x++) {
> int err = zin.read();
> if(err != -1) {
> b[x] = (byte)err;
> }
> }
> String newurl = url + "/";
> String fname = entry.getName();
> newurl += fname;
> URL aURL = new URL(newurl);
> String base = aURL.toString();
> int i = fname.lastIndexOf('.');
> if (i != -1) {
> // Trying to resolve the Mime-Type
> MimeType mt = MIME.getMimeType(fname);
> if (mt != null) {
> String contentType = mt.getName();
> try {
> Metadata metadata = new Metadata();
> metadata.set(Response.CONTENT_LENGTH,
> Long.toString(entry.getSize()));
> metadata.set(Response.CONTENT_TYPE, contentType);
> Content content = new Content(newurl, base, b, contentType,
> metadata, this.conf);
> Parse parse = new ParseUtil(this.conf).parse(content);
> ParseData theParseData = parse.getData();
> Outlink[] theOutlinks = theParseData.getOutlinks();
>
> for(int count = 0; count < theOutlinks.length; count++) {
> outLinksList.add(new Outlink(theOutlinks[count].getToUrl(),
> theOutlinks[count].getAnchor(), this.conf));
> }
>
> resultText += entry.getName() + " " + parse.getText() + " ";
> } catch (ParseException e) {
> if (LOG.isInfoEnabled()) {
> LOG.info("fetch okay, but can't parse " + fname + ", reason: "
> + e.getMessage());
> }
> }
> } else {
> resultText += entry.getName();
> }
> }
> }
> }
>
> return resultText;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers