I have a 3MB xls, with 26 sheets. Half have a matrix of approx 1100xP and the
others have approx 1000xE.
Using the v0.9 ExcelExtractor, I left it extracting text on a reasonably
powerful machine @ 100% CPU (Java 1.6). Just over 4 hours later it was still
going!!
I finally gave up waiting and stopped it.
Having changed the extractor to use StringBuffer, it takes 3 seconds to extract
the 1088233 characters of text. Changes to extractText() below if wanted.
Antony
protected String extractText(InputStream input) throws Exception {
String resultText = "";
HSSFWorkbook wb = new HSSFWorkbook(input);
if (wb == null) {
return resultText;
}
HSSFSheet sheet;
HSSFRow row;
HSSFCell cell;
int sNum = 0;
int rNum = 0;
int cNum = 0;
sNum = wb.getNumberOfSheets();
// Allow 4K per sheet - seems a reasonable start
StringBuffer sb = new StringBuffer(4096 * sNum);
for (int i=0; i<sNum; i++) {
if ((sheet = wb.getSheetAt(i)) == null) {
continue;
}
rNum = sheet.getLastRowNum();
for (int j=0; j<=rNum; j++) {
if ((row = sheet.getRow(j)) == null){
continue;
}
cNum = row.getLastCellNum();
for (int k=0; k<cNum; k++) {
if ((cell = row.getCell((short) k)) != null) {
/*if(HSSFDateUtil.isCellDateFormatted(cell) == true) {
resultText += cell.getDateCellValue().toString() + " ";
} else
*/
if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
sb.append(cell.getStringCellValue());
sb.append(' ');
} else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
Double d = new Double(cell.getNumericCellValue());
sb.append(d.toString());
sb.append(' ');
}
/* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
resultText += cell.getCellFormula() + " ";
}
*/
}
}
}
}
return sb.toString();
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general