NullPointerException in ZipTextExtractor if no MIME type for zipped file
------------------------------------------------------------------------
Key: NUTCH-472
URL: https://issues.apache.org/jira/browse/NUTCH-472
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 0.9.0
Environment: Any
Reporter: Antony Bowesman
extractText throws a NPE in
String contentType = MIME.getMimeType(fname).getName();
if the file in the zip has no configured mime type which breaks the parsing of
the zip.
Code should do:
public String extractText(InputStream input, String url, List outLinksList)
throws IOException {
String resultText = "";
byte temp;
ZipInputStream zin = new ZipInputStream(input);
ZipEntry entry;
while ((entry = zin.getNextEntry()) != null) {
if (!entry.isDirectory()) {
int size = (int) entry.getSize();
byte[] b = new byte[size];
for(int x = 0; x < size; x++) {
int err = zin.read();
if(err != -1) {
b[x] = (byte)err;
}
}
String newurl = url + "/";
String fname = entry.getName();
newurl += fname;
URL aURL = new URL(newurl);
String base = aURL.toString();
int i = fname.lastIndexOf('.');
if (i != -1) {
// Trying to resolve the Mime-Type
MimeType mt = MIME.getMimeType(fname);
if (mt != null) {
String contentType = mt.getName();
try {
Metadata metadata = new Metadata();
metadata.set(Response.CONTENT_LENGTH,
Long.toString(entry.getSize()));
metadata.set(Response.CONTENT_TYPE, contentType);
Content content = new Content(newurl, base, b, contentType,
metadata, this.conf);
Parse parse = new ParseUtil(this.conf).parse(content);
ParseData theParseData = parse.getData();
Outlink[] theOutlinks = theParseData.getOutlinks();
for(int count = 0; count < theOutlinks.length; count++) {
outLinksList.add(new Outlink(theOutlinks[count].getToUrl(),
theOutlinks[count].getAnchor(), this.conf));
}
resultText += entry.getName() + " " + parse.getText() + " ";
} catch (ParseException e) {
if (LOG.isInfoEnabled()) {
LOG.info("fetch okay, but can't parse " + fname + ", reason: " +
e.getMessage());
}
}
} else {
resultText += entry.getName();
}
}
}
}
return resultText;
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers