[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-13 Thread Kenneth Lui (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864963#comment-15864963
 ] 

Kenneth Lui commented on TIKA-1857:
---

Hi, I tried to use this feature but it doesn't seem to work. I understand this 
is not the right place to ask troubleshooting type of question, so I put the 
details at 
http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content
 . Could you please help whether I misconfigured Tika or it is an issue about 
the feature implementation. Thanks!

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2264) Better handling of footnotes/endnotes for ODF files

2017-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863841#comment-15863841
 ] 

Tim Allison commented on TIKA-2264:
---

Can you submit an example document and some unit tests for what behavior you 
want?  Thank you, again!

> Better handling of footnotes/endnotes for ODF files
> ---
>
> Key: TIKA-2264
> URL: https://issues.apache.org/jira/browse/TIKA-2264
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
> Environment: N/A
>Reporter: Mike Rodent
>Priority: Minor
>  Labels: newbie
> Attachments: ImprovedODFContentParser.java
>
>
> Springs from my question here 
> (http://stackoverflow.com/questions/42031237/modify-apache-tika-parsing-of-old-1997-2003-ms-word-docs)
>  ... I have improved the class OpenDocumentContentParser so that it puts 
> footnotes/endnotes at the end of the line to which they belong and doesn't 
> break up the line in question.  As with .docx parsing the notes can be linked 
> to the reference easily.  The respondee in Stack Overflow suggested I open an 
> issue here... 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

2017-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863722#comment-15863722
 ] 

Tim Allison commented on TIKA-2265:
---

Are you able to share a triggering docx? The ones in our unit tests aren't 
showing the problem. Thank you, again.

> Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) 
> files
> 
>
> Key: TIKA-2265
> URL: https://issues.apache.org/jira/browse/TIKA-2265
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: N/A
>Reporter: Mike Rodent
>Assignee: Tim Allison
>Priority: Minor
>  Labels: newbie
>
> It seems to be the case that a footnote numbered "1" in the real document 
> will be outputted by Tika.parseToString() as "2" in the footnote reference, 
> and "2" in the corresponding footnote body text real footnote "2" becomes 
> "3", "3" becomes "4", etc.  Have not yet looked at source code ... I can't 
> imagine it would be difficult to correct this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

2017-02-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-2265:
-

Assignee: Tim Allison

> Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) 
> files
> 
>
> Key: TIKA-2265
> URL: https://issues.apache.org/jira/browse/TIKA-2265
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: N/A
>Reporter: Mike Rodent
>Assignee: Tim Allison
>Priority: Minor
>  Labels: newbie
>
> It seems to be the case that a footnote numbered "1" in the real document 
> will be outputted by Tika.parseToString() as "2" in the footnote reference, 
> and "2" in the corresponding footnote body text real footnote "2" becomes 
> "3", "3" becomes "4", etc.  Have not yet looked at source code ... I can't 
> imagine it would be difficult to correct this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2242) opendocument parsing produces malformed xml

2017-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863680#comment-15863680
 ] 

Tim Allison commented on TIKA-2242:
---

Ha.  Sorry, now, for my delay.  I'll work on this over the next few days in 
coordination with TIKA-2264.

> opendocument parsing produces malformed xml
> ---
>
> Key: TIKA-2242
> URL: https://issues.apache.org/jira/browse/TIKA-2242
> Project: Tika
>  Issue Type: Bug
>  Components: handler, parser
>Affects Versions: 1.13, 1.14
>Reporter: Jan Van Raemdonck
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
> Attachments: 2017-01-02-16B833-16B833VANCAUTEREN.odt, 
> 2017-02-01-15B96Ghijsens-17B96GHIJSENS.odt
>
>
> For some odt documents, a malformed xml is produced when parsing. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

2017-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863681#comment-15863681
 ] 

Tim Allison commented on TIKA-2265:
---

Thank you for opening this.  I'll take a look.

> Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) 
> files
> 
>
> Key: TIKA-2265
> URL: https://issues.apache.org/jira/browse/TIKA-2265
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: N/A
>Reporter: Mike Rodent
>Assignee: Tim Allison
>Priority: Minor
>  Labels: newbie
>
> It seems to be the case that a footnote numbered "1" in the real document 
> will be outputted by Tika.parseToString() as "2" in the footnote reference, 
> and "2" in the corresponding footnote body text real footnote "2" becomes 
> "3", "3" becomes "4", etc.  Have not yet looked at source code ... I can't 
> imagine it would be difficult to correct this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2264) Better handling of footnotes/endnotes for ODF files

2017-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863678#comment-15863678
 ] 

Tim Allison commented on TIKA-2264:
---

Thank you for opening this and submitting a patch.  I'll take a look over the 
next few days...prob. Wed or after that.

> Better handling of footnotes/endnotes for ODF files
> ---
>
> Key: TIKA-2264
> URL: https://issues.apache.org/jira/browse/TIKA-2264
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
> Environment: N/A
>Reporter: Mike Rodent
>Priority: Minor
>  Labels: newbie
> Attachments: ImprovedODFContentParser.java
>
>
> Springs from my question here 
> (http://stackoverflow.com/questions/42031237/modify-apache-tika-parsing-of-old-1997-2003-ms-word-docs)
>  ... I have improved the class OpenDocumentContentParser so that it puts 
> footnotes/endnotes at the end of the line to which they belong and doesn't 
> break up the line in question.  As with .docx parsing the notes can be linked 
> to the reference easily.  The respondee in Stack Overflow suggested I open an 
> issue here... 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2264) Better handling of footnotes/endnotes for ODF files

2017-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863677#comment-15863677
 ] 

Tim Allison commented on TIKA-2264:
---

These two are likely related.

> Better handling of footnotes/endnotes for ODF files
> ---
>
> Key: TIKA-2264
> URL: https://issues.apache.org/jira/browse/TIKA-2264
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
> Environment: N/A
>Reporter: Mike Rodent
>Priority: Minor
>  Labels: newbie
> Attachments: ImprovedODFContentParser.java
>
>
> Springs from my question here 
> (http://stackoverflow.com/questions/42031237/modify-apache-tika-parsing-of-old-1997-2003-ms-word-docs)
>  ... I have improved the class OpenDocumentContentParser so that it puts 
> footnotes/endnotes at the end of the line to which they belong and doesn't 
> break up the line in question.  As with .docx parsing the notes can be linked 
> to the reference easily.  The respondee in Stack Overflow suggested I open an 
> issue here... 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)