[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2023-11-17 Thread Cassandra Xia (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787339#comment-17787339
 ] 

Cassandra Xia commented on TIKA-4171:
-

Reattaching the screenshot! !screenshot.png!

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2023-11-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787376#comment-17787376
 ] 

Tim Allison commented on TIKA-4171:
---

Y, we were overwriting key values in a map in our XFAExtractor.  How's the 
attached look?

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2023-11-17 Thread Cassandra Xia (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787379#comment-17787379
 ] 

Cassandra Xia commented on TIKA-4171:
-

The attached looks perfect!! Thank you for the incredibly speedy response 
[~tallison] !! 

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2023-11-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787380#comment-17787380
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison opened a new pull request, #1460:
URL: https://github.com/apache/tika/pull/1460

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2023-11-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787390#comment-17787390
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison merged PR #1460:
URL: https://github.com/apache/tika/pull/1460




> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2023-11-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787426#comment-17787426
 ] 

Hudson commented on TIKA-4171:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1396 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1396/])
TIKA-4171 (#1460) (github: 
[https://github.com/apache/tika/commit/4684912a9e3cc15c02a7654f4e6f8f65003cf510])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java


> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830110#comment-17830110
 ] 

Tilman Hausherr commented on TIKA-4171:
---

We have a regression with the file [^876503.pdf] in the XFAExtractor class. 
What happens is that {{displayFieldName}} is now lost if {{fieldValues}} is 
empty. Because of that, the text "Enter the full name of the conveying party or 
parties" is missing for field the "conname1".

I'm not saying that this is wrong, I just wonder if this is intended.

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830113#comment-17830113
 ] 

Tilman Hausherr commented on TIKA-4171:
---

Proposed change: add these 3 lines before the last one in this code segment
{code:java}
if (fieldValues.length == 0) {
fieldValues = new String[]{""};
}
for (String fieldValue : fieldValues) {
{code}
This is the result of the file from testXFAExtractionBasic 
(testPDF_XFA_govdocs1_258578.pdf) which now has 27 fields instead of 24:  
[^testPDF_XFA_govdocs1_258578.pdf.html] 

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830147#comment-17830147
 ] 

Tim Allison commented on TIKA-4171:
---

Thank you, [~tilman]! I'll make the update before rerunning the regression 
tests.

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830151#comment-17830151
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison opened a new pull request, #1678:
URL: https://github.com/apache/tika/pull/1678

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830153#comment-17830153
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison opened a new pull request, #1679:
URL: https://github.com/apache/tika/pull/1679

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830152#comment-17830152
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison closed pull request #1678: TIKA-4171 

> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830572#comment-17830572
 ] 

ASF GitHub Bot commented on TIKA-4171:
--

tballison merged PR #1679:
URL: https://github.com/apache/tika/pull/1679




> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4171) Tika server only returns last value for PDFs that have multiple of the same key

2024-03-25 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830677#comment-17830677
 ] 

Hudson commented on TIKA-4171:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1571 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1571/])
TIKA-4171 -- fix regression when field names are missing in the XFAExtractor 
(#1679) (tallison: 
[https://github.com/apache/tika/commit/b9ab4813ed16f53a0bf3aa61883da2cebdf7f3a1])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java


> Tika server only returns last value for PDFs that have multiple of the same 
> key
> ---
>
> Key: TIKA-4171
> URL: https://issues.apache.org/jira/browse/TIKA-4171
> Project: Tika
>  Issue Type: Bug
>  Components: tika-server
>Reporter: Cassandra Xia
>Priority: Major
> Fix For: 3.0.0-BETA, 2.9.2
>
> Attachments: 20230801-5207_QF20-270 East River Solar Form 556 recert 
> FINAL.pdf, 876503.pdf, example-output.txt, screenshot.png, 
> testPDF_XFA_govdocs1_258578.pdf.html
>
>
> Thanks for the great work on Tika server, it is the only OSS that can handle 
> Adobe's protected form format that FERC uses. 
> One problem that I'm hitting is that the FERC form that I am parsing has 
> multiple values for the same key name, e.g. in the screenshot below line 1-7 
> all have the same key name. When Tika Server parses this PDF, it only returns 
> the value in row 7 (losing the previous 6 values).
> My hunch is that somewhere in Tika Server, the values are getting stored in 
> some dictionary object, so the final value is the only survivor. Would it be 
> possible to return the extra values as a list from Tika Server? 
> Example PDF attached - thank you for taking a look!
> !https://mail.google.com/mail/u/0?ui=2&ik=ee87dc4bd1&attid=0.0.7&permmsgid=msg-f:1782641700487887488&th=18bd372e8760fa80&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9qEkw6kZ9yBDfMBOUuvFB1Tk8Pti0rRvReEq-eWUoJQxLA6rZ0TQvWCsKUySaDPjjrSi-IiyKseDYpFGzF44A3iSaFw9sOanoBdFMNEZciDnaGhsUFvLSIH_0&disp=emb&realattid=ii_lmdun7ff6!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)