[jira] [Created] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character
Theodor Sjöstedt created TIKA-1428: -- Summary: Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character Key: TIKA-1428 URL: https://issues.apache.org/jira/browse/TIKA-1428 Project: Tika Issue Type: Bug Affects Versions: 1.6, 1.4 Reporter: Theodor Sjöstedt Priority: Minor Footnotes from {{.doc}} documents are extracted, but the references to the footnotes are replaced by the Unicode Replacement Character (�). I have tried this in 1.4 and 1.6. In 1.4, both reference in text and reference at footnote have been replaced. In 1.6, reference in text has disappeared completely. See attached image for original document, 1.4 Formatted text, and 1.6 Formatted text. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character
[ https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodor Sjöstedt updated TIKA-1428: --- Attachment: TIKA-doc-footnotes-issue.png Original document to the left. TIKA 1.4 in Center TIKA 1.6 to the right Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character - Key: TIKA-1428 URL: https://issues.apache.org/jira/browse/TIKA-1428 Project: Tika Issue Type: Bug Affects Versions: 1.4, 1.6 Reporter: Theodor Sjöstedt Priority: Minor Attachments: TIKA-doc-footnotes-issue.png Footnotes from {{.doc}} documents are extracted, but the references to the footnotes are replaced by the Unicode Replacement Character (�). I have tried this in 1.4 and 1.6. In 1.4, both reference in text and reference at footnote have been replaced. In 1.6, reference in text has disappeared completely. See attached image for original document, 1.4 Formatted text, and 1.6 Formatted text. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character
[ https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147880#comment-14147880 ] Hong-Thai Nguyen commented on TIKA-1428: Thanks [~theoettheo], any chance to have a patch with a test case for this problem ? Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character - Key: TIKA-1428 URL: https://issues.apache.org/jira/browse/TIKA-1428 Project: Tika Issue Type: Bug Affects Versions: 1.4, 1.6 Reporter: Theodor Sjöstedt Priority: Minor Attachments: TIKA-doc-footnotes-issue.png Footnotes from {{.doc}} documents are extracted, but the references to the footnotes are replaced by the Unicode Replacement Character (�). I have tried this in 1.4 and 1.6. In 1.4, both reference in text and reference at footnote have been replaced. In 1.6, reference in text has disappeared completely. See attached image for original document, 1.4 Formatted text, and 1.6 Formatted text. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1330: -- Attachment: TIKA-1330v1-patch.zip This is the first version of tika-batch. Much cleanup remains. This first patch is intended to start the framework and offer concrete classes for filesystem (FS) handling...single input directory and single output directory. I've included the patch against trunk, example log4jxml files, an example batch-config file and two sh scripts to kick off the two different processes. Any and all feedback is welcomed! Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121454#comment-14121454 ] Tim Allison edited comment on TIKA-1330 at 9/25/14 4:18 PM: Started documentation on the [wiki|https://wiki.apache.org/tika/TikaBatchOverview]. Any and all feedback is welcomed. Will post patch to rb (if possible) or to this issue some time next week. was (Author: talli...@mitre.org): Started documentation on the [wiki|https://wiki.apache.org/tika/TikaBatch]. Any and all feedback is welcomed. Will post patch to rb (if possible) or to this issue some time next week. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1330) Add robust tika-batch code
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147922#comment-14147922 ] Tim Allison commented on TIKA-1330: --- [~tilman], I leave it as an exercise to implement a FileResourceConsumer that uses pure PDFBox. ;) Seriously, though, I plan to add something like that in the tika examples module (at some point down the road), and all feedback is welcome. Add robust tika-batch code -- Key: TIKA-1330 URL: https://issues.apache.org/jira/browse/TIKA-1330 Project: Tika Issue Type: Sub-task Components: cli, general, server Reporter: Tim Allison Assignee: Tim Allison Attachments: TIKA-1330v1-patch.zip In my current design plan, I see creating a separate component tika-batch that includes a small bit of configurable code to run Tika against a large batch of documents. This code should be robust against OOM and hangs, and it should have fairly robust logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Tika at ApacheCon Europe - 2 months time!
Hey Nick, On 22 Sep 2014, at 23:21, Nick Burch n...@apache.org wrote: It's only 2 months to go until ApacheCon Europe in Budapest. I'm simultaneously exciting by all the great Tika stuff going on, and worried by how many talks I need to finish writing... As usual for an ApacheCon, we've a number of talks about Tika going on, and almost certainly a hackathon and/or meetup one evening. There's also lots of related talks too, covering technologies that Tika builds on, and ones you can use Tika with. For a full schedule, see: http://events.linuxfoundation.org/events/apachecon-europe/program/schedule All signed up for ApacheCon EU this year, so looking forward to the talks and up for a Tika hackathon. See you then, Dave
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148398#comment-14148398 ] Vineet Ghatge commented on TIKA-1423: - Pulling up the data and JAR file and trying to setup environment for the failing scenario Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.7 Attachments: GribParser.java, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1415) PowerPoint2003 embedded with word. The embedded file can not be detected.
[ https://issues.apache.org/jira/browse/TIKA-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148528#comment-14148528 ] sunxingzhe commented on TIKA-1415: -- Atthachment is the correction results, please confirm. Atthachment file:HSLFExtractor.java_diff.html PowerPoint2003 embedded with word. The embedded file can not be detected. - Key: TIKA-1415 URL: https://issues.apache.org/jira/browse/TIKA-1415 Project: Tika Issue Type: Bug Components: detector, parser Affects Versions: 1.5 Environment: window7 Reporter: sunxingzhe Labels: Tika, poi Attachments: HSLFExtractor.java_diff.html, PowerPointParserTest.java, test.java, word2003.ppt, word2007.ppt Word2003 or word2007 insert into Powerpoint2003 as embedded file。 The embedded file‘s type can not be detected。 The embedded file's content can not be parsed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1415) PowerPoint2003 embedded with word. The embedded file can not be detected.
[ https://issues.apache.org/jira/browse/TIKA-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148528#comment-14148528 ] sunxingzhe edited comment on TIKA-1415 at 9/26/14 2:44 AM: --- Atthachment is the modification result, please confirm. Atthachment file:HSLFExtractor.java_diff.html was (Author: sunxingzhe359): Atthachment is the correction results, please confirm. Atthachment file:HSLFExtractor.java_diff.html PowerPoint2003 embedded with word. The embedded file can not be detected. - Key: TIKA-1415 URL: https://issues.apache.org/jira/browse/TIKA-1415 Project: Tika Issue Type: Bug Components: detector, parser Affects Versions: 1.5 Environment: window7 Reporter: sunxingzhe Labels: Tika, poi Attachments: HSLFExtractor.java_diff.html, PowerPointParserTest.java, test.java, word2003.ppt, word2007.ppt Word2003 or word2007 insert into Powerpoint2003 as embedded file。 The embedded file‘s type can not be detected。 The embedded file's content can not be parsed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Apache Tika - JSON?
Hello all, I was wondering if there any in built parser to get help in conversion from XHTML to JSON. My research showed that there is one named org.apache.io.json which just one method implemented. Also, I tried GJSON library to do this, but it does not seem to work with Tika. Any suggestions will be appreciated? Regards, Vineet