[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680747#comment-13680747 ]
Ryan Krueger commented on TIKA-1132: ------------------------------------ Running jvisualvm and pulling a thread dump I get the same trace each time: "main" prio=10 tid=0x0000000000606800 nid=0x7799 runnable [0x00007fe26bf1d000] java.lang.Thread.State: RUNNABLE at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1009) at org.apache.poi.ss.usermodel.DataFormatter$FractionFormat.format(DataFormatter.java:1033) at java.text.Format.format(Format.java:157) at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:699) at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:669) at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:129) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:419) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:323) at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82) at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:299) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:151) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Looking at POI 3.8 in grepcode I see the affected code. The methods appear to be unchanged in 3.9. I don't know what's causing the issue as it doesn't immediately appear to me to be an infinite loop. Here is the apparent section from org.apache.poi.ss.usermodel.DataFormatter. 1005 double minVal = 1.0; 1006 double currDenom = Math.pow(10 , fractParts[1].length()) - 1d; 1007 double currNeum = 0; 1008 for (int i = (int)(Math.pow(10, fractParts[1].length())- 1d); i > 0; i--) { 1009 for(int i2 = (int)(Math.pow(10, fractParts[1].length())- 1d); i2 > 0; i2--){ 1010 if (minVal >= Math.abs((double)i2/(double)i - decPart)) { 1011 currDenom = i; 1012 currNeum = i2; 1013 minVal = Math.abs((double)i2/(double)i - decPart); 1014 } 1015 } 1016 } > Parsing some XLS documents hangs entire JVM, requires kill -9 > ------------------------------------------------------------- > > Key: TIKA-1132 > URL: https://issues.apache.org/jira/browse/TIKA-1132 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.2, 1.3 > Environment: Linux Suse: > java version "1.7.0" > Java(TM) SE Runtime Environment (build 1.7.0-b147) > Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) > OSX 10.8.3: > java version "1.7.0_06" > Java(TM) SE Runtime Environment (build 1.7.0_06-b24) > Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) > Reporter: Ryan Krueger > Fix For: 1.1 > > Attachments: mod.xls > > > Some XLS documents hang the entire JVM. A control-C or regular kill won't > stop the JVM, a kill -9 is required. > We're running within an email server application parsing documents to extract > text of all attachments. When we hit a message with the affected attachment > the entire JVM hangs and we mark the message to skip extracting the text from > the affected message the next attempt. Unfortunately, it kills all email > processing on the server until the internal watchdogs kill -9 the application. > We have seen the issue for several months with different documents, but they > are always Excel files. Some get complaints from Excel when opening but not > all. > In addition to experiencing the problem on our Linux servers I have tested on > OSX and experienced the same problems. I ran the Tika UI and select the > affected file or run the CLI. The problem is the same. > Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls > When running on multi-CPU machines there are two threads running at 100% > every time. > I have attached a document that triggers the error. > I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is > accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira