[jira] [Created] (TIKA-1178) Improve docx multiple section handling - headers and footers

2013-10-04 Thread David Cole (JIRA)
David Cole created TIKA-1178:


 Summary: Improve docx multiple section handling - headers and 
footers
 Key: TIKA-1178
 URL: https://issues.apache.org/jira/browse/TIKA-1178
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: David Cole
Priority: Minor


Currently docx to plain text is only accurate for single page files. First off, 
the sectPr tag right above the closing body tag is not the overall document 
property; it is the section property of the last section(if there is only one, 
then yes it is the overall document property per say). right now if I had a 
large docx file (let's say a book which i broke each chapter into it's own 
section) then i would get the last chapter's header as the beginning document's 
header.

Addressing sectPr tags inside paragraphs:
why are we wrapping the paragraph with the header and footer?
we should be buffering up pages as we read the docx file, until we hit a 
section property where we decide how to wrap what we just consumed. I realize 
that it is difficult to determine page breaks when it is caused by overflow 
(not explicit page breaks). 

The time for completion is really dependent on how much improvement we want to 
add in this area.

Just for reference, my assumptions on open office xml structure interpretation 
come from the documentation on this site: 
http://www.ecma-international.org/publications/standards/Ecma-376.htm





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-10-04 Thread Marius Dumitru Florea (JIRA)
Marius Dumitru Florea created TIKA-1179:
---

 Summary: A corrupt mp3 file can cause an infinite loop in Mp3Parser
 Key: TIKA-1179
 URL: https://issues.apache.org/jira/browse/TIKA-1179
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Marius Dumitru Florea
 Fix For: 1.5


I have a thread that indexes (among other things) files using Apache Sorl. This 
thread hangs (still running but with no progress) when trying to extract meta 
data from the mp3 file attached to this issue. Here are a couple of thread 
dumps taken at various moments:

{noformat}
XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
runnable [0x7f46f4617000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
at 
org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
at 
org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
- locked 0xcb7094e8 (a java.io.BufferedInputStream)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.tika.io.TailStream.read(TailStream.java:117)
at org.apache.tika.io.TailStream.skip(TailStream.java:140)
at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
at 
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:380)
...
{noformat}

{noformat}
XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
runnable [0x7f46f4618000]
   java.lang.Thread.State: RUNNABLE
at org.apache.tika.io.TailStream.skip(TailStream.java:133)
at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
at 
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:380)
...
{noformat}

{noformat}
XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
runnable [0x7f46f4617000]
   java.lang.Thread.State: RUNNABLE
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
- locked 0xcb1be170 (a java.io.BufferedInputStream)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.tika.io.TailStream.read(TailStream.java:117)
at org.apache.tika.io.TailStream.skip(TailStream.java:140)
at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
at 
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:380)
...
{noformat}

This makes our Solr indexer very fragile as it prevents it from indexing other 
files thus leading to incomplete search results.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-10-04 Thread Marius Dumitru Florea (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Dumitru Florea updated TIKA-1179:


Attachment: corrupt.mp3

 A corrupt mp3 file can cause an infinite loop in Mp3Parser
 --

 Key: TIKA-1179
 URL: https://issues.apache.org/jira/browse/TIKA-1179
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Marius Dumitru Florea
 Fix For: 1.5

 Attachments: corrupt.mp3


 I have a thread that indexes (among other things) files using Apache Sorl. 
 This thread hangs (still running but with no progress) when trying to extract 
 meta data from the mp3 file attached to this issue. Here are a couple of 
 thread dumps taken at various moments:
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
   at 
 org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
   at 
 org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.BufferedInputStream.fill(Unknown Source)
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb7094e8 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4618000]
java.lang.Thread.State: RUNNABLE
   at org.apache.tika.io.TailStream.skip(TailStream.java:133)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb1be170 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 This makes our Solr indexer very fragile as it prevents it from indexing 
 other files thus leading to incomplete search results.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1177:
--

Assignee: Ray Gauss II

 Add Matroska (mkv, mka) format detection
 

 Key: TIKA-1177
 URL: https://issues.apache.org/jira/browse/TIKA-1177
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.4
Reporter: Boris Naguet
Assignee: Ray Gauss II
Priority: Minor

 There's no mimetype detection for Matroska format, although it's a popular 
 video format.
 Here is some code I added in my custom mimetypes to detect them:
 {code}
   mime-type type=video/x-matroska
   glob pattern=*.mkv /
   magic priority=40
   match value=0x1A45DFA3934282886d6174726f736b61 
 type=string offset=0 /
   /magic
   /mime-type
   mime-type type=audio/x-matroska
   glob pattern=*.mka /
   /mime-type
 {code}
 I found the signature for the mkv on: 
 http://www.garykessler.net/library/file_sigs.html
 I was not able to find it clearly for mka, but detection by filename is still 
 useful.
 Although, the full spec is available here:
 http://matroska.org/technical/specs/index.html
 Maybe it's a bit more complex than this constant magic, but it works on my 
 tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1177.


   Resolution: Fixed
Fix Version/s: 1.5

Unfortunately that magic doesn't seem to be required in all MKV files.  I tired 
several utilities to convert various sources to MKV and none contained that 
magic.

A magic value of {{0x1A45DFA3}} is present, but that's also present in WebM  
which is extended from Matroska.

I've added Matroska mime-types based on just extension for now and also added 
the WebM mime-type.

We can open other issues, linked to this one, for data detection of MKV and 
WebM files if need be.

Resolved in r1529260.

 Add Matroska (mkv, mka) format detection
 

 Key: TIKA-1177
 URL: https://issues.apache.org/jira/browse/TIKA-1177
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.4
Reporter: Boris Naguet
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5


 There's no mimetype detection for Matroska format, although it's a popular 
 video format.
 Here is some code I added in my custom mimetypes to detect them:
 {code}
   mime-type type=video/x-matroska
   glob pattern=*.mkv /
   magic priority=40
   match value=0x1A45DFA3934282886d6174726f736b61 
 type=string offset=0 /
   /magic
   /mime-type
   mime-type type=audio/x-matroska
   glob pattern=*.mka /
   /mime-type
 {code}
 I found the signature for the mkv on: 
 http://www.garykessler.net/library/file_sigs.html
 I was not able to find it clearly for mka, but detection by filename is still 
 useful.
 Although, the full spec is available here:
 http://matroska.org/technical/specs/index.html
 Maybe it's a bit more complex than this constant magic, but it works on my 
 tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1179.


Resolution: Cannot Reproduce
  Assignee: Ray Gauss II

I've just confirmed the described behavior in Tika 1.4, however, it appears the 
file is parsed just fine in 1.5!

You can verify by downloading a 1.5 snapshot of {{tika-app}} ([current 
link|https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.5-SNAPSHOT/tika-app-1.5-20130927.201341-30.jar]),
 running the app, i.e.:
{code}
java -jar tika-app-1.5-20130927.201341-30.jar
{code}
and dropping {{corrupt.mp3}} onto the app window.

 A corrupt mp3 file can cause an infinite loop in Mp3Parser
 --

 Key: TIKA-1179
 URL: https://issues.apache.org/jira/browse/TIKA-1179
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Marius Dumitru Florea
Assignee: Ray Gauss II
 Fix For: 1.5

 Attachments: corrupt.mp3


 I have a thread that indexes (among other things) files using Apache Sorl. 
 This thread hangs (still running but with no progress) when trying to extract 
 meta data from the mp3 file attached to this issue. Here are a couple of 
 thread dumps taken at various moments:
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
   at 
 org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
   at 
 org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.BufferedInputStream.fill(Unknown Source)
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb7094e8 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4618000]
java.lang.Thread.State: RUNNABLE
   at org.apache.tika.io.TailStream.skip(TailStream.java:133)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb1be170 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at