[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864963#comment-15864963 ] Kenneth Lui commented on TIKA-1857: --- Hi, I tried to use this feature but it doesn't seem to work. I understand this is not the right place to ask troubleshooting type of question, so I put the details at http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content . Could you please help whether I misconfigured Tika or it is an issue about the feature implementation. Thanks! > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2264) Better handling of footnotes/endnotes for ODF files
[ https://issues.apache.org/jira/browse/TIKA-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863841#comment-15863841 ] Tim Allison commented on TIKA-2264: --- Can you submit an example document and some unit tests for what behavior you want? Thank you, again! > Better handling of footnotes/endnotes for ODF files > --- > > Key: TIKA-2264 > URL: https://issues.apache.org/jira/browse/TIKA-2264 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 > Environment: N/A >Reporter: Mike Rodent >Priority: Minor > Labels: newbie > Attachments: ImprovedODFContentParser.java > > > Springs from my question here > (http://stackoverflow.com/questions/42031237/modify-apache-tika-parsing-of-old-1997-2003-ms-word-docs) > ... I have improved the class OpenDocumentContentParser so that it puts > footnotes/endnotes at the end of the line to which they belong and doesn't > break up the line in question. As with .docx parsing the notes can be linked > to the reference easily. The respondee in Stack Overflow suggested I open an > issue here... -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files
[ https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863722#comment-15863722 ] Tim Allison commented on TIKA-2265: --- Are you able to share a triggering docx? The ones in our unit tests aren't showing the problem. Thank you, again. > Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) > files > > > Key: TIKA-2265 > URL: https://issues.apache.org/jira/browse/TIKA-2265 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: N/A >Reporter: Mike Rodent >Assignee: Tim Allison >Priority: Minor > Labels: newbie > > It seems to be the case that a footnote numbered "1" in the real document > will be outputted by Tika.parseToString() as "2" in the footnote reference, > and "2" in the corresponding footnote body text real footnote "2" becomes > "3", "3" becomes "4", etc. Have not yet looked at source code ... I can't > imagine it would be difficult to correct this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files
[ https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-2265: - Assignee: Tim Allison > Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) > files > > > Key: TIKA-2265 > URL: https://issues.apache.org/jira/browse/TIKA-2265 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: N/A >Reporter: Mike Rodent >Assignee: Tim Allison >Priority: Minor > Labels: newbie > > It seems to be the case that a footnote numbered "1" in the real document > will be outputted by Tika.parseToString() as "2" in the footnote reference, > and "2" in the corresponding footnote body text real footnote "2" becomes > "3", "3" becomes "4", etc. Have not yet looked at source code ... I can't > imagine it would be difficult to correct this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2242) opendocument parsing produces malformed xml
[ https://issues.apache.org/jira/browse/TIKA-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863680#comment-15863680 ] Tim Allison commented on TIKA-2242: --- Ha. Sorry, now, for my delay. I'll work on this over the next few days in coordination with TIKA-2264. > opendocument parsing produces malformed xml > --- > > Key: TIKA-2242 > URL: https://issues.apache.org/jira/browse/TIKA-2242 > Project: Tika > Issue Type: Bug > Components: handler, parser >Affects Versions: 1.13, 1.14 >Reporter: Jan Van Raemdonck >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > Attachments: 2017-01-02-16B833-16B833VANCAUTEREN.odt, > 2017-02-01-15B96Ghijsens-17B96GHIJSENS.odt > > > For some odt documents, a malformed xml is produced when parsing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files
[ https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863681#comment-15863681 ] Tim Allison commented on TIKA-2265: --- Thank you for opening this. I'll take a look. > Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) > files > > > Key: TIKA-2265 > URL: https://issues.apache.org/jira/browse/TIKA-2265 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: N/A >Reporter: Mike Rodent >Assignee: Tim Allison >Priority: Minor > Labels: newbie > > It seems to be the case that a footnote numbered "1" in the real document > will be outputted by Tika.parseToString() as "2" in the footnote reference, > and "2" in the corresponding footnote body text real footnote "2" becomes > "3", "3" becomes "4", etc. Have not yet looked at source code ... I can't > imagine it would be difficult to correct this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2264) Better handling of footnotes/endnotes for ODF files
[ https://issues.apache.org/jira/browse/TIKA-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863678#comment-15863678 ] Tim Allison commented on TIKA-2264: --- Thank you for opening this and submitting a patch. I'll take a look over the next few days...prob. Wed or after that. > Better handling of footnotes/endnotes for ODF files > --- > > Key: TIKA-2264 > URL: https://issues.apache.org/jira/browse/TIKA-2264 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 > Environment: N/A >Reporter: Mike Rodent >Priority: Minor > Labels: newbie > Attachments: ImprovedODFContentParser.java > > > Springs from my question here > (http://stackoverflow.com/questions/42031237/modify-apache-tika-parsing-of-old-1997-2003-ms-word-docs) > ... I have improved the class OpenDocumentContentParser so that it puts > footnotes/endnotes at the end of the line to which they belong and doesn't > break up the line in question. As with .docx parsing the notes can be linked > to the reference easily. The respondee in Stack Overflow suggested I open an > issue here... -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2264) Better handling of footnotes/endnotes for ODF files
[ https://issues.apache.org/jira/browse/TIKA-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863677#comment-15863677 ] Tim Allison commented on TIKA-2264: --- These two are likely related. > Better handling of footnotes/endnotes for ODF files > --- > > Key: TIKA-2264 > URL: https://issues.apache.org/jira/browse/TIKA-2264 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 > Environment: N/A >Reporter: Mike Rodent >Priority: Minor > Labels: newbie > Attachments: ImprovedODFContentParser.java > > > Springs from my question here > (http://stackoverflow.com/questions/42031237/modify-apache-tika-parsing-of-old-1997-2003-ms-word-docs) > ... I have improved the class OpenDocumentContentParser so that it puts > footnotes/endnotes at the end of the line to which they belong and doesn't > break up the line in question. As with .docx parsing the notes can be linked > to the reference easily. The respondee in Stack Overflow suggested I open an > issue here... -- This message was sent by Atlassian JIRA (v6.3.15#6346)