ExcepExtractor performance bad due to String concatenation
----------------------------------------------------------
Key: NUTCH-473
URL: https://issues.apache.org/jira/browse/NUTCH-473
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects Versions: 0.9.0
Environment: Tested under Windows, Java 1.5 and 1.6
Reporter: Antony Bowesman
Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% CPU
trying to extract the text from a 3MB Excel file containing 26 sheets, half
with a matrix of approx 1100 rows x P columns and the others with approx 1000
rows x E columns.
After changing ExcelExtractor to use StringBuffer the same extraction process
took 3 seconds under Java 1.5. Code changes below - example uses a 4K buffer
per sheet - this was a completely arbitrary choice but keeps the number of
StringBuffer expansions low for large files without using too much space for
small files.
protected String extractText(InputStream input) throws Exception {
String resultText = "";
HSSFWorkbook wb = new HSSFWorkbook(input);
if (wb == null) {
return resultText;
}
HSSFSheet sheet;
HSSFRow row;
HSSFCell cell;
int sNum = 0;
int rNum = 0;
int cNum = 0;
sNum = wb.getNumberOfSheets();
// Allow 4K per sheet - seems a reasonable start
StringBuffer sb = new StringBuffer(4096 * sNum);
for (int i=0; i<sNum; i++) {
if ((sheet = wb.getSheetAt(i)) == null) {
continue;
}
rNum = sheet.getLastRowNum();
for (int j=0; j<=rNum; j++) {
if ((row = sheet.getRow(j)) == null){
continue;
}
cNum = row.getLastCellNum();
for (int k=0; k<cNum; k++) {
if ((cell = row.getCell((short) k)) != null) {
/*if(HSSFDateUtil.isCellDateFormatted(cell) == true) {
resultText += cell.getDateCellValue().toString() + " ";
} else
*/
if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
sb.append(cell.getStringCellValue());
sb.append(' ');
// resultText += cell.getStringCellValue() + " ";
} else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
Double d = new Double(cell.getNumericCellValue());
sb.append(d.toString());
sb.append(' ');
// resultText += d.toString() + " ";
}
/* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
resultText += cell.getCellFormula() + " ";
}
*/
}
}
}
}
return sb.toString();
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers