[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-18 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153851#comment-15153851
 ] 

Maruan Sahyoun edited comment on TIKA-1857 at 2/19/16 7:33 AM:
---

Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the field 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)


was (Author: msahyoun):
Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the filed 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-18 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153851#comment-15153851
 ] 

Maruan Sahyoun commented on TIKA-1857:
--

Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the filed 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153717#comment-15153717
 ] 

Hudson commented on TIKA-1851:
--

UNSTABLE: Integrated in tika-2.x #27 (See 
[https://builds.apache.org/job/tika-2.x/27/])
TIKA-1851: remove dependency in tika-examples on tika-core-tests.jar (tallison: 
rev 8debbe1c5441cdd0955ee9634f302f537be3d69e)
* tika-parser-modules/tika-parser-database-module/pom.xml
* CHANGES.txt


> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
> Attachments: tika_2x_test_files_and_modules.xlsx
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1861) Upgrade to sqlite-jdbc 3.8.11.2

2016-02-18 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1861.
---
   Resolution: Fixed
Fix Version/s: 1.13

> Upgrade to sqlite-jdbc 3.8.11.2
> ---
>
> Key: TIKA-1861
> URL: https://issues.apache.org/jira/browse/TIKA-1861
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.13
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1861) Upgrade to sqlite-jdbc 3.8.11.2

2016-02-18 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1861:
-

 Summary: Upgrade to sqlite-jdbc 3.8.11.2
 Key: TIKA-1861
 URL: https://issues.apache.org/jira/browse/TIKA-1861
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


notes for individual parsers on our wiki

2016-02-18 Thread Allison, Timothy B.
Chris et al. have done a great job on our wiki with instructions for the 
advanced parsers.

I thought it might be helpful to add a section for the "classic" parsers notes 
on use (building, integrating, configuring) and anything that users might find 
surprising.

I created a link from our front page to: 
https://wiki.apache.org/tika/TikaParserNotes

Cheers,

 Tim


[jira] [Comment Edited] (TIKA-1859) file poi reads tika does not bring the content

2016-02-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153502#comment-15153502
 ] 

Tim Allison edited comment on TIKA-1859 at 2/19/16 1:37 AM:


To build Tika from trunk with the latest version of POI, see our new [wiki 
page|https://wiki.apache.org/tika/MSOfficeParsers].  Before you build Tika, 
you'll need to apply this patch.  I'm going to wait to make any commits to Tika 
until the next version of POI is released.

If you have any questions about how to build and integrate both projects from 
trunk, please ask on the u...@tika.apache.org list.


was (Author: talli...@mitre.org):
To build Tika from trunk with the latest version of POI, see our new 
[[https://wiki.apache.org/tika/MSOfficeParsers|wiki page]].  Before you build 
Tika, you'll need to apply this patch.  I'm going to wait to make any commits 
to Tika until the next version of POI is released.

If you have any questions about how to build and integrate both projects from 
trunk, please ask on the u...@tika.apache.org list.

> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1859) file poi reads tika does not bring the content

2016-02-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153502#comment-15153502
 ] 

Tim Allison edited comment on TIKA-1859 at 2/19/16 1:36 AM:


To build Tika from trunk with the latest version of POI, see our new 
[[https://wiki.apache.org/tika/MSOfficeParsers|wiki page]].  Before you build 
Tika, you'll need to apply this patch.  I'm going to wait to make any commits 
to Tika until the next version of POI is released.

If you have any questions about how to build and integrate both projects from 
trunk, please ask on the u...@tika.apache.org list.


was (Author: talli...@mitre.org):
To build Tika from trunk with the latest version of POI, see our new 
[https://wiki.apache.org/tika/MSOfficeParsers|wiki page].  Before you build 
Tika, you'll need to apply this patch.  I'm going to wait to make any commits 
to Tika until the next version of POI is released.

If you have any questions about how to build and integrate both projects from 
trunk, please ask on the u...@tika.apache.org list.

> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1859) file poi reads tika does not bring the content

2016-02-18 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1859:
--
Attachment: upgrade_to_POI_3_14_beta2.patch

To build Tika from trunk with the latest version of POI, see our new 
[https://wiki.apache.org/tika/MSOfficeParsers|wiki page].  Before you build 
Tika, you'll need to apply this patch.  I'm going to wait to make any commits 
to Tika until the next version of POI is released.

If you have any questions about how to build and integrate both projects from 
trunk, please ask on the u...@tika.apache.org list.

> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle

2016-02-18 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153497#comment-15153497
 ] 

Bob Paulin commented on TIKA-1860:
--

I'd like to propose a means of replacing the tika-bundle project by create 
individual bundles for each module that would inline dependencies just as the 
tika-bundle did except at the module level.  My current thinking is we could do 
this with a classifier called bundle that would build an addition JAR file for 
each module.  This is goes slightly against the maven model of one artifact per 
pom but would prevent separate projects for each module as I have now (see 
https://github.com/apache/tika/tree/2.x/tika-parser-bundles/tika-multimedia-bundle
 ). Not sure if there are other opinions on this from the community.   The 
proposed changes are in a branch of 2.x here:

https://github.com/apache/tika/compare/2.x...bundle-classifier

> Tika 2.0 - Create Module OSGi implementations to replace tika-bundle
> 
>
> Key: TIKA-1860
> URL: https://issues.apache.org/jira/browse/TIKA-1860
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create a replacement for the OSGi tika-bundle project out of the new 
> tika-parser-* modules



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1860) Tika 2.0 - Create Module OSGi implementations to replace tika-bundle

2016-02-18 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1860:


 Summary: Tika 2.0 - Create Module OSGi implementations to replace 
tika-bundle
 Key: TIKA-1860
 URL: https://issues.apache.org/jira/browse/TIKA-1860
 Project: Tika
  Issue Type: Sub-task
Reporter: Bob Paulin
Assignee: Bob Paulin


Create a replacement for the OSGi tika-bundle project out of the new 
tika-parser-* modules



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content

2016-02-18 Thread Movses (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152825#comment-15152825
 ] 

Movses commented on TIKA-1859:
--

Ok Tim no problem just do the commit and give me the instructions and I make it 


> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Attachments: testing.Xlsx
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content

2016-02-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152653#comment-15152653
 ] 

Tim Allison commented on TIKA-1859:
---

Hi [~mkiredjian],
  I'm sorry, I can't do a personal build for you.  If you ask on the users 
list, I can give you instructions on how to build both trunks (POI and Tika) 
and you'll have a hot-off-the-press tika-app.
  Before that's possible, though, I need to commit one change to Tika's 
XSSFExcelExtractorDecorator to make the parser namespace aware, otherwise the 
fix in POI doesn't work.

> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Attachments: testing.Xlsx
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)