[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153851#comment-15153851 ] Maruan Sahyoun edited comment on TIKA-1857 at 2/19/16 7:33 AM: --- Sorry for my delay in answering your question. May I propose the following strategy: a) for static XFA if there is datasets.data use that content for the field values otherwise extract from the AcroForm. b) for dynamic XFA scrape/extract info from the XFA. Why a different proposal for a) from yours? Adobe Reader/Acrobat use the information from dataset.data for the field value over the possibly differing content in AcroForm (which might happen if the form has been filled out with an XFA aware processor and afterwards was amended with a non XFA aware processor) was (Author: msahyoun): Sorry for my delay in answering your question. May I propose the following strategy: a) for static XFA if there is datasets.data use that content for the filed values otherwise extract from the AcroForm. b) for dynamic XFA scrape/extract info from the XFA. Why a different proposal for a) from yours? Adobe Reader/Acrobat use the information from dataset.data for the field value over the possibly differing content in AcroForm (which might happen if the form has been filled out with an XFA aware processor and afterwards was amended with a non XFA aware processor) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153851#comment-15153851 ] Maruan Sahyoun commented on TIKA-1857: -- Sorry for my delay in answering your question. May I propose the following strategy: a) for static XFA if there is datasets.data use that content for the filed values otherwise extract from the AcroForm. b) for dynamic XFA scrape/extract info from the XFA. Why a different proposal for a) from yours? Adobe Reader/Acrobat use the information from dataset.data for the field value over the possibly differing content in AcroForm (which might happen if the form has been filled out with an XFA aware processor and afterwards was amended with a non XFA aware processor) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources
[ https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153717#comment-15153717 ] Hudson commented on TIKA-1851: -- UNSTABLE: Integrated in tika-2.x #27 (See [https://builds.apache.org/job/tika-2.x/27/]) TIKA-1851: remove dependency in tika-examples on tika-core-tests.jar (tallison: rev 8debbe1c5441cdd0955ee9634f302f537be3d69e) * tika-parser-modules/tika-parser-database-module/pom.xml * CHANGES.txt > Tika 2.0 - Move test resources from core to test-resources > -- > > Key: TIKA-1851 > URL: https://issues.apache.org/jira/browse/TIKA-1851 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Trivial > Fix For: 2.0 > > Attachments: tika_2x_test_files_and_modules.xlsx > > > Let's try to move resources that are used for testing to the test-resources > module if possible: MockParser, DummyParser, TikaTest and the unit tests for > MockParser. That should also allow us to drop the test-jar goal in > tika-core. Anything else? > Haven't actually tried this yet; there may be surprises. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1861) Upgrade to sqlite-jdbc 3.8.11.2
[ https://issues.apache.org/jira/browse/TIKA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1861. --- Resolution: Fixed Fix Version/s: 1.13 > Upgrade to sqlite-jdbc 3.8.11.2 > --- > > Key: TIKA-1861 > URL: https://issues.apache.org/jira/browse/TIKA-1861 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.13 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1861) Upgrade to sqlite-jdbc 3.8.11.2
Tim Allison created TIKA-1861: - Summary: Upgrade to sqlite-jdbc 3.8.11.2 Key: TIKA-1861 URL: https://issues.apache.org/jira/browse/TIKA-1861 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332)
notes for individual parsers on our wiki
Chris et al. have done a great job on our wiki with instructions for the advanced parsers. I thought it might be helpful to add a section for the "classic" parsers notes on use (building, integrating, configuring) and anything that users might find surprising. I created a link from our front page to: https://wiki.apache.org/tika/TikaParserNotes Cheers, Tim
[jira] [Comment Edited] (TIKA-1859) file poi reads tika does not bring the content
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153502#comment-15153502 ] Tim Allison edited comment on TIKA-1859 at 2/19/16 1:37 AM: To build Tika from trunk with the latest version of POI, see our new [wiki page|https://wiki.apache.org/tika/MSOfficeParsers]. Before you build Tika, you'll need to apply this patch. I'm going to wait to make any commits to Tika until the next version of POI is released. If you have any questions about how to build and integrate both projects from trunk, please ask on the u...@tika.apache.org list. was (Author: talli...@mitre.org): To build Tika from trunk with the latest version of POI, see our new [[https://wiki.apache.org/tika/MSOfficeParsers|wiki page]]. Before you build Tika, you'll need to apply this patch. I'm going to wait to make any commits to Tika until the next version of POI is released. If you have any questions about how to build and integrate both projects from trunk, please ask on the u...@tika.apache.org list. > file poi reads tika does not bring the content > -- > > Key: TIKA-1859 > URL: https://issues.apache.org/jira/browse/TIKA-1859 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Movses >Priority: Blocker > Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch > > > I have a file xlsx I'm able to read and process in using poi but in tika it > does not extract the content of the file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1859) file poi reads tika does not bring the content
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153502#comment-15153502 ] Tim Allison edited comment on TIKA-1859 at 2/19/16 1:36 AM: To build Tika from trunk with the latest version of POI, see our new [[https://wiki.apache.org/tika/MSOfficeParsers|wiki page]]. Before you build Tika, you'll need to apply this patch. I'm going to wait to make any commits to Tika until the next version of POI is released. If you have any questions about how to build and integrate both projects from trunk, please ask on the u...@tika.apache.org list. was (Author: talli...@mitre.org): To build Tika from trunk with the latest version of POI, see our new [https://wiki.apache.org/tika/MSOfficeParsers|wiki page]. Before you build Tika, you'll need to apply this patch. I'm going to wait to make any commits to Tika until the next version of POI is released. If you have any questions about how to build and integrate both projects from trunk, please ask on the u...@tika.apache.org list. > file poi reads tika does not bring the content > -- > > Key: TIKA-1859 > URL: https://issues.apache.org/jira/browse/TIKA-1859 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Movses >Priority: Blocker > Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch > > > I have a file xlsx I'm able to read and process in using poi but in tika it > does not extract the content of the file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1859) file poi reads tika does not bring the content
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1859: -- Attachment: upgrade_to_POI_3_14_beta2.patch To build Tika from trunk with the latest version of POI, see our new [https://wiki.apache.org/tika/MSOfficeParsers|wiki page]. Before you build Tika, you'll need to apply this patch. I'm going to wait to make any commits to Tika until the next version of POI is released. If you have any questions about how to build and integrate both projects from trunk, please ask on the u...@tika.apache.org list. > file poi reads tika does not bring the content > -- > > Key: TIKA-1859 > URL: https://issues.apache.org/jira/browse/TIKA-1859 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Movses >Priority: Blocker > Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch > > > I have a file xlsx I'm able to read and process in using poi but in tika it > does not extract the content of the file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153497#comment-15153497 ] Bob Paulin commented on TIKA-1860: -- I'd like to propose a means of replacing the tika-bundle project by create individual bundles for each module that would inline dependencies just as the tika-bundle did except at the module level. My current thinking is we could do this with a classifier called bundle that would build an addition JAR file for each module. This is goes slightly against the maven model of one artifact per pom but would prevent separate projects for each module as I have now (see https://github.com/apache/tika/tree/2.x/tika-parser-bundles/tika-multimedia-bundle ). Not sure if there are other opinions on this from the community. The proposed changes are in a branch of 2.x here: https://github.com/apache/tika/compare/2.x...bundle-classifier > Tika 2.0 - Create Module OSGi implementations to replace tika-bundle > > > Key: TIKA-1860 > URL: https://issues.apache.org/jira/browse/TIKA-1860 > Project: Tika > Issue Type: Sub-task >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create a replacement for the OSGi tika-bundle project out of the new > tika-parser-* modules -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle
Bob Paulin created TIKA-1860: Summary: Tika 2.0 - Create Module OSGi implementations to replace tika-bundle Key: TIKA-1860 URL: https://issues.apache.org/jira/browse/TIKA-1860 Project: Tika Issue Type: Sub-task Reporter: Bob Paulin Assignee: Bob Paulin Create a replacement for the OSGi tika-bundle project out of the new tika-parser-* modules -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152825#comment-15152825 ] Movses commented on TIKA-1859: -- Ok Tim no problem just do the commit and give me the instructions and I make it > file poi reads tika does not bring the content > -- > > Key: TIKA-1859 > URL: https://issues.apache.org/jira/browse/TIKA-1859 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Movses >Priority: Blocker > Attachments: testing.Xlsx > > > I have a file xlsx I'm able to read and process in using poi but in tika it > does not extract the content of the file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152653#comment-15152653 ] Tim Allison commented on TIKA-1859: --- Hi [~mkiredjian], I'm sorry, I can't do a personal build for you. If you ask on the users list, I can give you instructions on how to build both trunks (POI and Tika) and you'll have a hot-off-the-press tika-app. Before that's possible, though, I need to commit one change to Tika's XSSFExcelExtractorDecorator to make the parser namespace aware, otherwise the fix in POI doesn't work. > file poi reads tika does not bring the content > -- > > Key: TIKA-1859 > URL: https://issues.apache.org/jira/browse/TIKA-1859 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Movses >Priority: Blocker > Attachments: testing.Xlsx > > > I have a file xlsx I'm able to read and process in using poi but in tika it > does not extract the content of the file -- This message was sent by Atlassian JIRA (v6.3.4#6332)