[jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1133.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1491680.

> Ability to Allow Empty and Duplicate Tika Values for XML Elements
> -
>
> Key: TIKA-1133
> URL: https://issues.apache.org/jira/browse/TIKA-1133
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.4
>
>
> In some cases it is beneficial to allow empty and duplicate Tika metadata 
> values for multi-valued XML elements like RDF bags.
> Consider an example where the original source metadata is structured 
> something like:
> {code}
> 
>   John
>   Smith
> 
> 
>   Jane
>   Doe
> 
> 
>   Bob
> 
> 
>   Kate
>   Smith
> 
> {code}
> and since Tika stores only flat metadata we transform that before invoking a 
> parser to something like:
> {code}
>  
>   
>John
>Jane
>Bob
>Kate
>   
>  
>  
>   
>Smith
>Doe
>
>Smith
>   
>  
> {code}
> The current behavior ignores empties and duplicates and we don't know if Bob 
> or Kate ever had last names.  Empties or duplicates in other positions result 
> in an incorrect mapping of data.
> We should allow the option to create an {{ElementMetadataHandler}} which 
> allows empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1133:
--

 Summary: Ability to Allow Empty and Duplicate Tika Values for XML 
Elements
 Key: TIKA-1133
 URL: https://issues.apache.org/jira/browse/TIKA-1133
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II


In some cases it is beneficial to allow empty and duplicate Tika metadata 
values for multi-valued XML elements like RDF bags.

Consider an example where the original source metadata is structured something 
like:
{code}

  John
  Smith


  Jane
  Doe


  Bob


  Kate
  Smith

{code}

and since Tika stores only flat metadata we transform that before invoking a 
parser to something like:
{code}
 
  
   John
   Jane
   Bob
   Kate
  
 
 
  
   Smith
   Doe
   
   Smith
  
 
{code}

The current behavior ignores empties and duplicates and we don't know if Bob or 
Kate ever had last names.  Empties or duplicates in other positions result in 
an incorrect mapping of data.

We should allow the option to create an {{ElementMetadataHandler}} which allows 
empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680021#comment-13680021
 ] 

Nick Burch commented on TIKA-1132:
--

I can confirm that it goes into an infinite loop for me too

Any chance that you could run it in a profiler or similar, and track down where 
the loop is happening? (My hunch is it'll be an edge case in POI / POI not 
handling a subtle form of corruption)

> Parsing some XLS documents hangs entire JVM, requires kill -9
> -
>
> Key: TIKA-1132
> URL: https://issues.apache.org/jira/browse/TIKA-1132
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3
> Environment: Linux Suse:
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
> OSX 10.8.3:
> java version "1.7.0_06"
> Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
> Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
>Reporter: Ryan Krueger
> Fix For: 1.1
>
> Attachments: mod.xls
>
>
> Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
> stop the JVM, a kill -9 is required.
> We're running within an email server application parsing documents to extract 
> text of all attachments.  When we hit a message with the affected attachment 
> the entire JVM hangs and we mark the message to skip extracting the text from 
> the affected message the next attempt.  Unfortunately, it kills all email 
> processing on the server until the internal watchdogs kill -9 the application.
> We have seen the issue for several months with different documents, but they 
> are always Excel files.  Some get complaints from Excel when opening but not 
> all.
> In addition to experiencing the problem on our Linux servers I have tested on 
> OSX and experienced the same problems.  I ran the Tika UI and select the 
> affected file or run the CLI.  The problem is the same.
> Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
> When running on multi-CPU machines there are two threads running at 100% 
> every time.
> I have attached a document that triggers the error.
> I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
> accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-10 Thread Ryan Krueger (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680012#comment-13680012
 ] 

Ryan Krueger edited comment on TIKA-1132 at 6/10/13 10:51 PM:
--

This file triggers the error.

  was (Author: mctoon):
This is not the original file, but after removing private information the 
error still occurs.
  
> Parsing some XLS documents hangs entire JVM, requires kill -9
> -
>
> Key: TIKA-1132
> URL: https://issues.apache.org/jira/browse/TIKA-1132
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3
> Environment: Linux Suse:
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
> OSX 10.8.3:
> java version "1.7.0_06"
> Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
> Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
>Reporter: Ryan Krueger
> Fix For: 1.1
>
> Attachments: mod.xls
>
>
> Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
> stop the JVM, a kill -9 is required.
> We're running within an email server application parsing documents to extract 
> text of all attachments.  When we hit a message with the affected attachment 
> the entire JVM hangs and we mark the message to skip extracting the text from 
> the affected message the next attempt.  Unfortunately, it kills all email 
> processing on the server until the internal watchdogs kill -9 the application.
> We have seen the issue for several months with different documents, but they 
> are always Excel files.  Some get complaints from Excel when opening but not 
> all.
> In addition to experiencing the problem on our Linux servers I have tested on 
> OSX and experienced the same problems.  I ran the Tika UI and select the 
> affected file or run the CLI.  The problem is the same.
> Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
> When running on multi-CPU machines there are two threads running at 100% 
> every time.
> I have attached a document that triggers the error.
> I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
> accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-10 Thread Ryan Krueger (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Krueger updated TIKA-1132:
---

Description: 
Some XLS documents hang the entire JVM.  A control-C or regular kill won't stop 
the JVM, a kill -9 is required.

We're running within an email server application parsing documents to extract 
text of all attachments.  When we hit a message with the affected attachment 
the entire JVM hangs and we mark the message to skip extracting the text from 
the affected message the next attempt.  Unfortunately, it kills all email 
processing on the server until the internal watchdogs kill -9 the application.

We have seen the issue for several months with different documents, but they 
are always Excel files.  Some get complaints from Excel when opening but not 
all.

In addition to experiencing the problem on our Linux servers I have tested on 
OSX and experienced the same problems.  I ran the Tika UI and select the 
affected file or run the CLI.  The problem is the same.
Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls

When running on multi-CPU machines there are two threads running at 100% every 
time.

I have attached a document that triggers the error.

I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
accurately extracted.

  was:
Some XLS documents hang the entire JVM.  A control-C or regular kill won't stop 
the JVM, a kill -9 is required.

We're running within an email server application parsing documents to extract 
text of all attachments.  When we hit a message with the affected attachment 
the entire JVM hangs and we mark the message to skip extracting the text from 
the affected message the next attempt.  Unfortunately, it kills all email 
processing on the server until the internal watchdogs kill -9 the application.

We have seen the issue for several months with different documents, but they 
are always Excel files.  Some get complaints from Excel when opening but not 
all.

In addition to experiencing the problem on our Linux servers I have tested on 
OSX and experienced the same problems.  I ran the Tika UI and select the 
affected file or run the CLI.  The problem is the same.
Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls

When running on multi-CPU machines there are two threads running at 100% every 
time.

Unfortunately, so far I am unable to post any affected documents publicly due 
to privacy concerns, however, I can provide samples privately for testing.

I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
accurately extracted.


> Parsing some XLS documents hangs entire JVM, requires kill -9
> -
>
> Key: TIKA-1132
> URL: https://issues.apache.org/jira/browse/TIKA-1132
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3
> Environment: Linux Suse:
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
> OSX 10.8.3:
> java version "1.7.0_06"
> Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
> Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
>Reporter: Ryan Krueger
> Fix For: 1.1
>
> Attachments: mod.xls
>
>
> Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
> stop the JVM, a kill -9 is required.
> We're running within an email server application parsing documents to extract 
> text of all attachments.  When we hit a message with the affected attachment 
> the entire JVM hangs and we mark the message to skip extracting the text from 
> the affected message the next attempt.  Unfortunately, it kills all email 
> processing on the server until the internal watchdogs kill -9 the application.
> We have seen the issue for several months with different documents, but they 
> are always Excel files.  Some get complaints from Excel when opening but not 
> all.
> In addition to experiencing the problem on our Linux servers I have tested on 
> OSX and experienced the same problems.  I ran the Tika UI and select the 
> affected file or run the CLI.  The problem is the same.
> Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
> When running on multi-CPU machines there are two threads running at 100% 
> every time.
> I have attached a document that triggers the error.
> I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
> accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-10 Thread Ryan Krueger (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Krueger updated TIKA-1132:
---

Attachment: mod.xls

This is not the original file, but after removing private information the error 
still occurs.

> Parsing some XLS documents hangs entire JVM, requires kill -9
> -
>
> Key: TIKA-1132
> URL: https://issues.apache.org/jira/browse/TIKA-1132
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3
> Environment: Linux Suse:
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
> OSX 10.8.3:
> java version "1.7.0_06"
> Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
> Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)
>Reporter: Ryan Krueger
> Fix For: 1.1
>
> Attachments: mod.xls
>
>
> Some XLS documents hang the entire JVM.  A control-C or regular kill won't 
> stop the JVM, a kill -9 is required.
> We're running within an email server application parsing documents to extract 
> text of all attachments.  When we hit a message with the affected attachment 
> the entire JVM hangs and we mark the message to skip extracting the text from 
> the affected message the next attempt.  Unfortunately, it kills all email 
> processing on the server until the internal watchdogs kill -9 the application.
> We have seen the issue for several months with different documents, but they 
> are always Excel files.  Some get complaints from Excel when opening but not 
> all.
> In addition to experiencing the problem on our Linux servers I have tested on 
> OSX and experienced the same problems.  I ran the Tika UI and select the 
> affected file or run the CLI.  The problem is the same.
> Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls
> When running on multi-CPU machines there are two threads running at 100% 
> every time.
> Unfortunately, so far I am unable to post any affected documents publicly due 
> to privacy concerns, however, I can provide samples privately for testing.
> I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
> accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1132) Parsing some XLS documents hangs entire JVM, requires kill -9

2013-06-10 Thread Ryan Krueger (JIRA)
Ryan Krueger created TIKA-1132:
--

 Summary: Parsing some XLS documents hangs entire JVM, requires 
kill -9
 Key: TIKA-1132
 URL: https://issues.apache.org/jira/browse/TIKA-1132
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3, 1.2
 Environment: Linux Suse:
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)

OSX 10.8.3:
java version "1.7.0_06"
Java(TM) SE Runtime Environment (build 1.7.0_06-b24)
Java HotSpot(TM) 64-Bit Server VM (build 23.2-b09, mixed mode)

Reporter: Ryan Krueger
 Fix For: 1.1


Some XLS documents hang the entire JVM.  A control-C or regular kill won't stop 
the JVM, a kill -9 is required.

We're running within an email server application parsing documents to extract 
text of all attachments.  When we hit a message with the affected attachment 
the entire JVM hangs and we mark the message to skip extracting the text from 
the affected message the next attempt.  Unfortunately, it kills all email 
processing on the server until the internal watchdogs kill -9 the application.

We have seen the issue for several months with different documents, but they 
are always Excel files.  Some get complaints from Excel when opening but not 
all.

In addition to experiencing the problem on our Linux servers I have tested on 
OSX and experienced the same problems.  I ran the Tika UI and select the 
affected file or run the CLI.  The problem is the same.
Tested with java -jar /path/to/tika-app-1.3.jar -t /path/to/file.xls

When running on multi-CPU machines there are two threads running at 100% every 
time.

Unfortunately, so far I am unable to post any affected documents publicly due 
to privacy concerns, however, I can provide samples privately for testing.

I have tested with 1.2 and 1.3 with the same result.  Running 1.1 the text is 
accurately extracted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira