[jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements
[ https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1133. Resolution: Fixed Fix Version/s: 1.4 Resolved in r1491680. > Ability to Allow Empty and Duplicate Tika Values for XML Elements > - > > Key: TIKA-1133 > URL: https://issues.apache.org/jira/browse/TIKA-1133 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.3 >Reporter: Ray Gauss II >Assignee: Ray Gauss II > Fix For: 1.4 > > > In some cases it is beneficial to allow empty and duplicate Tika metadata > values for multi-valued XML elements like RDF bags. > Consider an example where the original source metadata is structured > something like: > {code} > > John > Smith > > > Jane > Doe > > > Bob > > > Kate > Smith > > {code} > and since Tika stores only flat metadata we transform that before invoking a > parser to something like: > {code} > > >John >Jane >Bob >Kate > > > > >Smith >Doe > >Smith > > > {code} > The current behavior ignores empties and duplicates and we don't know if Bob > or Kate ever had last names. Empties or duplicates in other positions result > in an incorrect mapping of data. > We should allow the option to create an {{ElementMetadataHandler}} which > allows empty and/or duplicate values. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements
Ray Gauss II created TIKA-1133: -- Summary: Ability to Allow Empty and Duplicate Tika Values for XML Elements Key: TIKA-1133 URL: https://issues.apache.org/jira/browse/TIKA-1133 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Assignee: Ray Gauss II In some cases it is beneficial to allow empty and duplicate Tika metadata values for multi-valued XML elements like RDF bags. Consider an example where the original source metadata is structured something like: {code} John Smith Jane Doe Bob Kate Smith {code} and since Tika stores only flat metadata we transform that before invoking a parser to something like: {code} John Jane Bob Kate Smith Doe Smith {code} The current behavior ignores empties and duplicates and we don't know if Bob or Kate ever had last names. Empties or duplicates in other positions result in an incorrect mapping of data. We should allow the option to create an {{ElementMetadataHandler}} which allows empty and/or duplicate values. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680021#comment-13680021 ] Nick Burch commented on TIKA-1132: -- I can confirm that it goes into an infinite loop for me too Any chance that you could run it in a profiler or similar, and track down where the loop is happening? (My hunch is it'll be an edge case in POI / POI not handling a subtle form of corruption) > Parsing some XLS documents hangs entire JVM, requires kill -9 > - > > Key: TIKA-1132 > URL: https://issues.apache.org/jira/browse/TIKA-1132 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3 > Environment: Linux Suse: > java version "1.7.0" > Java(TM) SE Runtime Environment (build 1.7.0-b147) > Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) > OSX 10.8.3: > java version "1.7.0_06" > Java(TM) SE Runtime Environment (build 1.7.0_06-b24) > Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) >Reporter: Ryan Krueger > Fix For: 1.1 > > Attachments: mod.xls > > > Some XLS documents hang the entire JVM. A control-C or regular kill won't > stop the JVM, a kill -9 is required. > We're running within an email server application parsing documents to extract > text of all attachments. When we hit a message with the affected attachment > the entire JVM hangs and we mark the message to skip extracting the text from > the affected message the next attempt. Unfortunately, it kills all email > processing on the server until the internal watchdogs kill -9 the application. > We have seen the issue for several months with different documents, but they > are always Excel files. Some get complaints from Excel when opening but not > all. > In addition to experiencing the problem on our Linux servers I have tested on > OSX and experienced the same problems. I ran the Tika UI and select the > affected file or run the CLI. The problem is the same. > Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls > When running on multi-CPU machines there are two threads running at 100% > every time. > I have attached a document that triggers the error. > I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is > accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680012#comment-13680012 ] Ryan Krueger edited comment on TIKA-1132 at 6/10/13 10:51 PM: -- This file triggers the error. was (Author: mctoon): This is not the original file, but after removing private information the error still occurs. > Parsing some XLS documents hangs entire JVM, requires kill -9 > - > > Key: TIKA-1132 > URL: https://issues.apache.org/jira/browse/TIKA-1132 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3 > Environment: Linux Suse: > java version "1.7.0" > Java(TM) SE Runtime Environment (build 1.7.0-b147) > Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) > OSX 10.8.3: > java version "1.7.0_06" > Java(TM) SE Runtime Environment (build 1.7.0_06-b24) > Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) >Reporter: Ryan Krueger > Fix For: 1.1 > > Attachments: mod.xls > > > Some XLS documents hang the entire JVM. A control-C or regular kill won't > stop the JVM, a kill -9 is required. > We're running within an email server application parsing documents to extract > text of all attachments. When we hit a message with the affected attachment > the entire JVM hangs and we mark the message to skip extracting the text from > the affected message the next attempt. Unfortunately, it kills all email > processing on the server until the internal watchdogs kill -9 the application. > We have seen the issue for several months with different documents, but they > are always Excel files. Some get complaints from Excel when opening but not > all. > In addition to experiencing the problem on our Linux servers I have tested on > OSX and experienced the same problems. I ran the Tika UI and select the > affected file or run the CLI. The problem is the same. > Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls > When running on multi-CPU machines there are two threads running at 100% > every time. > I have attached a document that triggers the error. > I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is > accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Krueger updated TIKA-1132: --- Description: Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. I have attached a document that triggers the error. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. was: Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. Unfortunately, so far I am unable to post any affected documents publicly due to privacy concerns, however, I can provide samples privately for testing. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. > Parsing some XLS documents hangs entire JVM, requires kill -9 > - > > Key: TIKA-1132 > URL: https://issues.apache.org/jira/browse/TIKA-1132 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3 > Environment: Linux Suse: > java version "1.7.0" > Java(TM) SE Runtime Environment (build 1.7.0-b147) > Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) > OSX 10.8.3: > java version "1.7.0_06" > Java(TM) SE Runtime Environment (build 1.7.0_06-b24) > Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) >Reporter: Ryan Krueger > Fix For: 1.1 > > Attachments: mod.xls > > > Some XLS documents hang the entire JVM. A control-C or regular kill won't > stop the JVM, a kill -9 is required. > We're running within an email server application parsing documents to extract > text of all attachments. When we hit a message with the affected attachment > the entire JVM hangs and we mark the message to skip extracting the text from > the affected message the next attempt. Unfortunately, it kills all email > processing on the server until the internal watchdogs kill -9 the application. > We have seen the issue for several months with different documents, but they > are always Excel files. Some get complaints from Excel when opening but not > all. > In addition to experiencing the problem on our Linux servers I have tested on > OSX and experienced the same problems. I ran the Tika UI and select the > affected file or run the CLI. The problem is the same. > Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls > When running on multi-CPU machines there are two threads running at 100% > every time. > I have attached a document that triggers the error. > I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is > accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
[ https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Krueger updated TIKA-1132: --- Attachment: mod.xls This is not the original file, but after removing private information the error still occurs. > Parsing some XLS documents hangs entire JVM, requires kill -9 > - > > Key: TIKA-1132 > URL: https://issues.apache.org/jira/browse/TIKA-1132 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3 > Environment: Linux Suse: > java version "1.7.0" > Java(TM) SE Runtime Environment (build 1.7.0-b147) > Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) > OSX 10.8.3: > java version "1.7.0_06" > Java(TM) SE Runtime Environment (build 1.7.0_06-b24) > Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) >Reporter: Ryan Krueger > Fix For: 1.1 > > Attachments: mod.xls > > > Some XLS documents hang the entire JVM. A control-C or regular kill won't > stop the JVM, a kill -9 is required. > We're running within an email server application parsing documents to extract > text of all attachments. When we hit a message with the affected attachment > the entire JVM hangs and we mark the message to skip extracting the text from > the affected message the next attempt. Unfortunately, it kills all email > processing on the server until the internal watchdogs kill -9 the application. > We have seen the issue for several months with different documents, but they > are always Excel files. Some get complaints from Excel when opening but not > all. > In addition to experiencing the problem on our Linux servers I have tested on > OSX and experienced the same problems. I ran the Tika UI and select the > affected file or run the CLI. The problem is the same. > Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls > When running on multi-CPU machines there are two threads running at 100% > every time. > Unfortunately, so far I am unable to post any affected documents publicly due > to privacy concerns, however, I can provide samples privately for testing. > I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is > accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9
Ryan Krueger created TIKA-1132: -- Summary: Parsing some XLS documents hangs entire JVM, requires kill -9 Key: TIKA-1132 URL: https://issues.apache.org/jira/browse/TIKA-1132 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3, 1.2 Environment: Linux Suse: java version "1.7.0" Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) OSX 10.8.3: java version "1.7.0_06" Java(TM) SE Runtime Environment (build 1.7.0_06-b24) Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode) Reporter: Ryan Krueger Fix For: 1.1 Some XLS documents hang the entire JVM. A control-C or regular kill won't stop the JVM, a kill -9 is required. We're running within an email server application parsing documents to extract text of all attachments. When we hit a message with the affected attachment the entire JVM hangs and we mark the message to skip extracting the text from the affected message the next attempt. Unfortunately, it kills all email processing on the server until the internal watchdogs kill -9 the application. We have seen the issue for several months with different documents, but they are always Excel files. Some get complaints from Excel when opening but not all. In addition to experiencing the problem on our Linux servers I have tested on OSX and experienced the same problems. I ran the Tika UI and select the affected file or run the CLI. The problem is the same. Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls When running on multi-CPU machines there are two threads running at 100% every time. Unfortunately, so far I am unable to post any affected documents publicly due to privacy concerns, however, I can provide samples privately for testing. I have tested with 1.2 and 1.3 with the same result. Running 1.1 the text is accurately extracted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira