[jira] [Created] (TIKA-1178) Improve docx multiple section handling - headers and footers
David Cole created TIKA-1178: Summary: Improve docx multiple section handling - headers and footers Key: TIKA-1178 URL: https://issues.apache.org/jira/browse/TIKA-1178 Project: Tika Issue Type: Improvement Components: parser Reporter: David Cole Priority: Minor Currently docx to plain text is only accurate for single page files. First off, the sectPr tag right above the closing body tag is not the overall document property; it is the section property of the last section(if there is only one, then yes it is the overall document property per say). right now if I had a large docx file (let's say a book which i broke each chapter into it's own section) then i would get the last chapter's header as the beginning document's header. Addressing sectPr tags inside paragraphs: why are we wrapping the paragraph with the header and footer? we should be buffering up pages as we read the docx file, until we hit a section property where we decide how to wrap what we just consumed. I realize that it is difficult to determine page breaks when it is caused by overflow (not explicit page breaks). The time for completion is really dependent on how much improvement we want to add in this area. Just for reference, my assumptions on open office xml structure interpretation come from the documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser
Marius Dumitru Florea created TIKA-1179: --- Summary: A corrupt mp3 file can cause an infinite loop in Mp3Parser Key: TIKA-1179 URL: https://issues.apache.org/jira/browse/TIKA-1179 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Marius Dumitru Florea Fix For: 1.5 I have a thread that indexes (among other things) files using Apache Sorl. This thread hangs (still running but with no progress) when trying to extract meta data from the mp3 file attached to this issue. Here are a couple of thread dumps taken at various moments: {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63) at org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb7094e8 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4618000] java.lang.Thread.State: RUNNABLE at org.apache.tika.io.TailStream.skip(TailStream.java:133) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb1be170 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} This makes our Solr indexer very fragile as it prevents it from indexing other files thus leading to incomplete search results. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser
[ https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marius Dumitru Florea updated TIKA-1179: Attachment: corrupt.mp3 A corrupt mp3 file can cause an infinite loop in Mp3Parser -- Key: TIKA-1179 URL: https://issues.apache.org/jira/browse/TIKA-1179 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Marius Dumitru Florea Fix For: 1.5 Attachments: corrupt.mp3 I have a thread that indexes (among other things) files using Apache Sorl. This thread hangs (still running but with no progress) when trying to extract meta data from the mp3 file attached to this issue. Here are a couple of thread dumps taken at various moments: {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63) at org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb7094e8 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4618000] java.lang.Thread.State: RUNNABLE at org.apache.tika.io.TailStream.skip(TailStream.java:133) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb1be170 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} This makes our Solr indexer very fragile as it prevents it from indexing other files thus leading to incomplete search results. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (TIKA-1177) Add Matroska (mkv, mka) format detection
[ https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-1177: -- Assignee: Ray Gauss II Add Matroska (mkv, mka) format detection Key: TIKA-1177 URL: https://issues.apache.org/jira/browse/TIKA-1177 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.4 Reporter: Boris Naguet Assignee: Ray Gauss II Priority: Minor There's no mimetype detection for Matroska format, although it's a popular video format. Here is some code I added in my custom mimetypes to detect them: {code} mime-type type=video/x-matroska glob pattern=*.mkv / magic priority=40 match value=0x1A45DFA3934282886d6174726f736b61 type=string offset=0 / /magic /mime-type mime-type type=audio/x-matroska glob pattern=*.mka / /mime-type {code} I found the signature for the mkv on: http://www.garykessler.net/library/file_sigs.html I was not able to find it clearly for mka, but detection by filename is still useful. Although, the full spec is available here: http://matroska.org/technical/specs/index.html Maybe it's a bit more complex than this constant magic, but it works on my tests files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1177) Add Matroska (mkv, mka) format detection
[ https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1177. Resolution: Fixed Fix Version/s: 1.5 Unfortunately that magic doesn't seem to be required in all MKV files. I tired several utilities to convert various sources to MKV and none contained that magic. A magic value of {{0x1A45DFA3}} is present, but that's also present in WebM which is extended from Matroska. I've added Matroska mime-types based on just extension for now and also added the WebM mime-type. We can open other issues, linked to this one, for data detection of MKV and WebM files if need be. Resolved in r1529260. Add Matroska (mkv, mka) format detection Key: TIKA-1177 URL: https://issues.apache.org/jira/browse/TIKA-1177 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.4 Reporter: Boris Naguet Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 There's no mimetype detection for Matroska format, although it's a popular video format. Here is some code I added in my custom mimetypes to detect them: {code} mime-type type=video/x-matroska glob pattern=*.mkv / magic priority=40 match value=0x1A45DFA3934282886d6174726f736b61 type=string offset=0 / /magic /mime-type mime-type type=audio/x-matroska glob pattern=*.mka / /mime-type {code} I found the signature for the mkv on: http://www.garykessler.net/library/file_sigs.html I was not able to find it clearly for mka, but detection by filename is still useful. Although, the full spec is available here: http://matroska.org/technical/specs/index.html Maybe it's a bit more complex than this constant magic, but it works on my tests files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser
[ https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1179. Resolution: Cannot Reproduce Assignee: Ray Gauss II I've just confirmed the described behavior in Tika 1.4, however, it appears the file is parsed just fine in 1.5! You can verify by downloading a 1.5 snapshot of {{tika-app}} ([current link|https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.5-SNAPSHOT/tika-app-1.5-20130927.201341-30.jar]), running the app, i.e.: {code} java -jar tika-app-1.5-20130927.201341-30.jar {code} and dropping {{corrupt.mp3}} onto the app window. A corrupt mp3 file can cause an infinite loop in Mp3Parser -- Key: TIKA-1179 URL: https://issues.apache.org/jira/browse/TIKA-1179 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Marius Dumitru Florea Assignee: Ray Gauss II Fix For: 1.5 Attachments: corrupt.mp3 I have a thread that indexes (among other things) files using Apache Sorl. This thread hangs (still running but with no progress) when trying to extract meta data from the mp3 file attached to this issue. Here are a couple of thread dumps taken at various moments: {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63) at org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb7094e8 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4618000] java.lang.Thread.State: RUNNABLE at org.apache.tika.io.TailStream.skip(TailStream.java:133) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb1be170 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at