[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-10 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413470#comment-17413470
 ] 

Tilman Hausherr commented on TIKA-3544:
---

No. Use strings, that is the issue.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-10 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-3544.
-
Resolution: Won't Fix

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-10 Thread Jitin Jindal (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413469#comment-17413469
 ] 

Jitin Jindal commented on TIKA-3544:


So we aren’t fixing this issue ? 

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Dave Fisher (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412084#comment-17412084
 ] 

Dave Fisher commented on TIKA-3544:
---

The OP's source 
[https://getcreditcardnumbers.com|https://getcreditcardnumbers.com/] produces 
invalid numbers. In JSON and Javascript Numbers are always double precision 
floating point.

See [https://www.w3schools.com/js/js_numbers.asp]

 

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412072#comment-17412072
 ] 

Tim Allison commented on TIKA-3544:
---

>Use strings.

+1

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Dave Fisher (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412035#comment-17412035
 ] 

Dave Fisher commented on TIKA-3544:
---

See [https://en.wikipedia.org/wiki/Double-precision_floating-point_format]

Double can only keep between 15-17 digits of precision. I think you have to 
leave things at 15 digits or do more precise analysis which would be slower.

There is a reason why there is an error term called epsilon with floating point.

Credit Card Numbers are Strings of Numeric Characters. Use strings. Just like 
you have to use for US Zipcodes due to leading '0'.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412025#comment-17412025
 ] 

Tim Allison commented on TIKA-3544:
---

Oh, this is hilarious, if I type '6480195344542781' (16 digits), Excel 
automatically floors that to '6480195344542780' which means Excel is corrupting 
16 digit credit card numbers that do not happen to end in zero!   

I note that Excel is not rounding; it also floors '6480195344542789' to 
'6480195344542780'

So, y, we could bump it to 16, but that would be wrong 90% of the time...  I'm 
now inclined to propose that we not do anything here.

Note: This is Excel for Mac (16.52), your mileage may vary.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412012#comment-17412012
 ] 

Tim Allison edited comment on TIKA-3544 at 9/8/21, 3:40 PM:


In TIKA-2025 (which is nearly exactly this issue), we added a custom 
TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat.  This 
broke Excel and POI's default handling to allow up to 15 digits to be 
extracted. 

When I look at the underlying xml of the attached file, 6480195344642780 is, in 
fact, stored there.  If we bump our custom handling to 16 digits this problem 
would be solved _for this file_ and for numbers with 16 digits.

As Tilman and Nick note, though, Excel is really bad for numbers that might 
start with leading zeros, like credit card #s, etc.  You have to be really 
careful to enter them as strings or, better yet, use an actual database.


was (Author: talli...@mitre.org):
In TIKA-2025 (which is nearly exactly this issue), we added a custom 
TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat.  This 
broke Excel and POI's default handling to allow up to 15 digits to be 
extracted. 

When I look at the underlying xml, 6480195344642780 is, in fact, stored there.  
If we bump our custom handling to 16 digits this problem would be solved _for 
this file_ and for numbers with 16 digits.

As Tilman and Nick note, though, Excel is really bad for numbers that might 
start with leading zeros, like credit card #s, etc.  You have to be really 
careful to enter them as strings or, better yet, use an actual database.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412018#comment-17412018
 ] 

Tim Allison commented on TIKA-3544:
---

So, should we bump 15->16?

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412016#comment-17412016
 ] 

Tim Allison commented on TIKA-3544:
---

Y, I just tried bumping 15->16, and we get this output:

Credit Card Numbers (Source: 
http://www.getcreditcardnumbers.com/)
6480195344642780
30295201231669
30082494556063
344850003945824
358338792630
3587385370593640

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412012#comment-17412012
 ] 

Tim Allison commented on TIKA-3544:
---

In TIKA-2025 (which is nearly exactly this issue), we added a custom 
TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat.  This 
broke Excel and POI's default handling to allow up to 15 digits to be 
extracted. 

When I look at the underlying xml, 6480195344642780 is, in fact, stored there.  
If we bump our custom handling to 16 digits this problem would be solved _for 
this file_ and for numbers with 16 digits.

As Tilman and Nick note, though, Excel is really bad for numbers that might 
start with leading zeros, like credit card #s, etc.  You have to be really 
careful to enter them as strings or, better yet, use an actual database.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411814#comment-17411814
 ] 

Nick Burch commented on TIKA-3544:
--

Apache POI provides the DataFormatter class which attempts to turn the number 
into a string similar to the one shown in Excel, based on the formatting rules 
applied to the cell. That ought to be being used by Tika. Doesn't help 
completely if Excel has thrown away the last few digits though...

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411806#comment-17411806
 ] 

Tilman Hausherr commented on TIKA-3544:
---

Yeah it's that crazy. I have a spreadsheet from a client with staff id numbers. 
These are stored as numbers so I use Apache POI (and so does tika) and I have 
to call {{row.getCell(0).getNumericCellValue()}} which returns a double. Using 
{{getStringCellValue()}} instead brings an IllegalStateException "Cannot get a 
STRING value from a NUMERIC cell".

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411774#comment-17411774
 ] 

Nick Burch commented on TIKA-3544:
--

You need to be aware that Excel itself only stored numbers-as-numbers with a 
certain amount of precision (~15 digits). Any very long numbers will always 
risk having data and precision lost if stored as a number in Excel. You need to 
store those as strings (eg with a ' prefix) to avoid data loss

See 
[https://www.microsoft.com/en-us/microsoft-365/blog/2008/04/10/understanding-floating-point-precision-aka-why-does-excel-give-me-seemingly-wrong-answers/]
 for more info on this from Microsoft that you may wish to share with the 
people generating your spreadsheets with the risk of data loss

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710
 ] 

Tilman Hausherr commented on TIKA-3544:
---

It seems to depend on the value:
{noformat}
http://www.w3.org/1999/xhtml";>
















Payments - Payment Details
  Payment Details
Credit Card Numbers (Source: 
http://www.getcreditcardnumbers.com/)
6,48019534464278E+15
30295201231669
30082494556063
344850003945824
3,5833879263E+15
3,58738537059364E+15


&"Helvetica,Regular"&12&K00&P  
http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/

{noformat}


> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710
 ] 

Tilman Hausherr edited comment on TIKA-3544 at 9/8/21, 6:15 AM:


It seems to depend on the value (this output done with 2.1.1):
{noformat}
http://www.w3.org/1999/xhtml";>
















Payments - Payment Details
  Payment Details
Credit Card Numbers (Source: 
http://www.getcreditcardnumbers.com/)
6,48019534464278E+15
30295201231669
30082494556063
344850003945824
3,5833879263E+15
3,58738537059364E+15


&"Helvetica,Regular"&12&K00&P  
http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/

{noformat}



was (Author: tilman):
It seems to depend on the value:
{noformat}
http://www.w3.org/1999/xhtml";>
















Payments - Payment Details
  Payment Details
Credit Card Numbers (Source: 
http://www.getcreditcardnumbers.com/)
6,48019534464278E+15
30295201231669
30082494556063
344850003945824
3,5833879263E+15
3,58738537059364E+15


&"Helvetica,Regular"&12&K00&P  
http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/

{noformat}


> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
>     Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-07 Thread Jitin Jindal (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitin Jindal updated TIKA-3544:
---
Attachment: Credit Card Numbers.xlsx

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-07 Thread Jitin Jindal (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitin Jindal updated TIKA-3544:
---
Attachment: (was: Book1.xlsx)

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-07 Thread Jitin Jindal (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitin Jindal updated TIKA-3544:
---
Attachment: Book1.xlsx

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Book1.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-07 Thread Jitin Jindal (Jira)
Jitin Jindal created TIKA-3544:
--

 Summary: Extraction of long sequences of digits from Excel 
spreadsheets using Tika 1.20 doesn’t yield the expected results
 Key: TIKA-3544
 URL: https://issues.apache.org/jira/browse/TIKA-3544
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.20
Reporter: Jitin Jindal


If an Excel spreadsheet contains a long sequence of digits, such as a credit 
card number, Tika 1.13 will emit the said sequence in scientific notation.

For example, the credit card number “6011799905775830” is extracted from the 
attached spreadsheet as 6.480195344642784E15, which clearly is not the desired 
output.

I think the impact of this issue is significant. There’s plenty of information 
that can no longer be reliably extracted from spreadsheets. Think credit card 
numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-2877) Tika 1.20 suffer from 3 separate CVE vulnerabilities

2019-05-19 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2877.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 1.21

Will commit updates to site shortly and announce release of 1.21.

> Tika 1.20 suffer from 3 separate CVE vulnerabilities
> 
>
> Key: TIKA-2877
> URL: https://issues.apache.org/jira/browse/TIKA-2877
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.20
> Environment: These are generic issues.
>Reporter: Pat cashman
>Assignee: Tim Allison
>Priority: Critical
> Fix For: 1.21
>
>
> Tika 1.20 third party dependencies suffer from 3 separate CVE 
> vulnerabilitiesoutlined below
> I am aware that these are already included in a separate ticket which deals 
> with the generic problem of outdated 3rd party libraries. 
> [https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2854]
>  At the very least you should update your security page with the details and 
> potentially release 1.21 to correct these issues.. 
> [https://tika.apache.org/security.html]
>  
> *a) GUAVA v_17 -> - CVE-2018-10237*
> Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 
> allows remote attackers to conduct denial of service attacks against servers
> [https://nvd.nist.gov/vuln/detail//CVE-2018-10237]
>  
> *b) jackson-databind v_2.9.7 -> CVE-2018-19362*
> FasterXML jackson-databind 2.x before 2.9.8 might allow attackers to have 
> unspecified impact by leveraging failure to block the jboss-common-core class 
> from polymorphic deserialization.
> [https://nvd.nist.gov/vuln/detail/CVE-2018-19362]
>  
> *c) sqlite-jdbc v_3.25.2 ->CVE-2018-20346*
> SQLite before 3.25.3, when the FTS3 extension is enabled, encounters an 
> integer overflow (and resultant buffer overflow) for FTS3 queries that occur 
> after crafted changes to FTS3 shadow tables, allowing remote attackers to 
> execute arbitrary code by leveraging the ability to run arbitrary SQL 
> statements (such as in certain WebSQL use cases), aka Magellan.
> [https://nvd.nist.gov/vuln/detail/CVE-2018-20346]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2877) Tika 1.20 suffer from 3 separate CVE vulnerabilities

2019-05-16 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841492#comment-16841492
 ] 

Tim Allison commented on TIKA-2877:
---

Voting is underway for 1.21 : 
https://lists.apache.org/thread.html/2c027535156cc6862149490b289552d72ba5a9bff985fb7cce794e21@%3Cdev.tika.apache.org%3E

I can add a new table for dependency vulnerabilities on our security page.  
Thank you.

> Tika 1.20 suffer from 3 separate CVE vulnerabilities
> 
>
> Key: TIKA-2877
> URL: https://issues.apache.org/jira/browse/TIKA-2877
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.20
> Environment: These are generic issues.
>Reporter: Pat cashman
>Priority: Critical
>
> Tika 1.20 third party dependencies suffer from 3 separate CVE 
> vulnerabilitiesoutlined below
> I am aware that these are already included in a separate ticket which deals 
> with the generic problem of outdated 3rd party libraries. 
> [https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2854]
>  At the very least you should update your security page with the details and 
> potentially release 1.21 to correct these issues.. 
> [https://tika.apache.org/security.html]
>  
> *a) GUAVA v_17 -> - CVE-2018-10237*
> Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 
> allows remote attackers to conduct denial of service attacks against servers
> [https://nvd.nist.gov/vuln/detail//CVE-2018-10237]
>  
> *b) jackson-databind v_2.9.7 -> CVE-2018-19362*
> FasterXML jackson-databind 2.x before 2.9.8 might allow attackers to have 
> unspecified impact by leveraging failure to block the jboss-common-core class 
> from polymorphic deserialization.
> [https://nvd.nist.gov/vuln/detail/CVE-2018-19362]
>  
> *c) sqlite-jdbc v_3.25.2 ->CVE-2018-20346*
> SQLite before 3.25.3, when the FTS3 extension is enabled, encounters an 
> integer overflow (and resultant buffer overflow) for FTS3 queries that occur 
> after crafted changes to FTS3 shadow tables, allowing remote attackers to 
> execute arbitrary code by leveraging the ability to run arbitrary SQL 
> statements (such as in certain WebSQL use cases), aka Magellan.
> [https://nvd.nist.gov/vuln/detail/CVE-2018-20346]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2877) Tika 1.20 suffer from 3 separate CVE vulnerabilities

2019-05-16 Thread Pat cashman (JIRA)
Pat cashman created TIKA-2877:
-

 Summary: Tika 1.20 suffer from 3 separate CVE vulnerabilities
 Key: TIKA-2877
 URL: https://issues.apache.org/jira/browse/TIKA-2877
 Project: Tika
  Issue Type: Bug
  Components: app
Affects Versions: 1.20
 Environment: These are generic issues.
Reporter: Pat cashman


Tika 1.20 third party dependencies suffer from 3 separate CVE 
vulnerabilitiesoutlined below

I am aware that these are already included in a separate ticket which deals 
with the generic problem of outdated 3rd party libraries. 
[https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2854]

 At the very least you should update your security page with the details and 
potentially release 1.21 to correct these issues.. 

[https://tika.apache.org/security.html]

 

*a) GUAVA v_17 -> - CVE-2018-10237*

Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 
allows remote attackers to conduct denial of service attacks against servers

[https://nvd.nist.gov/vuln/detail//CVE-2018-10237]

 

*b) jackson-databind v_2.9.7 -> CVE-2018-19362*

FasterXML jackson-databind 2.x before 2.9.8 might allow attackers to have 
unspecified impact by leveraging failure to block the jboss-common-core class 
from polymorphic deserialization.

[https://nvd.nist.gov/vuln/detail/CVE-2018-19362]

 

*c) sqlite-jdbc v_3.25.2 ->CVE-2018-20346*

SQLite before 3.25.3, when the FTS3 extension is enabled, encounters an integer 
overflow (and resultant buffer overflow) for FTS3 queries that occur after 
crafted changes to FTS3 shadow tables, allowing remote attackers to execute 
arbitrary code by leveraging the ability to run arbitrary SQL statements (such 
as in certain WebSQL use cases), aka Magellan.

[https://nvd.nist.gov/vuln/detail/CVE-2018-20346]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-13 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838843#comment-16838843
 ] 

Tim Allison commented on TIKA-2869:
---

I doubly confirmed that this file now parses with 1.21-rc1: 
https://lists.apache.org/thread.html/36529c7df113e81ace51301175528120884af73b78edd40764a88cf8@%3Cdev.tika.apache.org%3E


> Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object 
> truncated by 465479)
> 
>
> Key: TIKA-2869
> URL: https://issues.apache.org/jira/browse/TIKA-2869
> Project: Tika
>  Issue Type: Bug
>  Components: app, cli, parser
>Affects Versions: 1.20
> Environment: Windows 10 (1809 - 17763.437)
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
>Reporter: Edans Sandes
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.21
>
> Attachments: 0001.127_342_5_7955.pdf
>
>
> I could convert the attached pdf using tika-app-1.19.1.jar, but now, in 
> version tika-app-1.20.jar, it stopped working.
> {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
> truncated by 465479
>  at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>  at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)
>  at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
>  at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  ... 5 more
> Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at java.io.BufferedInputStream.read1(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at org.bouncycastle.util.io.Streams.readFully(Unknown Source)
>  at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)
>  at java.io.BufferedInputStream.fill(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at java.io.FilterInputStream.read(Unknown Source)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>  ... 10 more
>  
>  
> {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
> 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
>  {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
> processed.}}
>  {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}}
>  {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
> or

[jira] [Commented] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837434#comment-16837434
 ] 

Tim Allison commented on TIKA-2869:
---

Fix made on master wasn't merged in {{branch_1x}}: 10d380ae

> Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object 
> truncated by 465479)
> 
>
> Key: TIKA-2869
> URL: https://issues.apache.org/jira/browse/TIKA-2869
> Project: Tika
>  Issue Type: Bug
>  Components: app, cli, parser
>Affects Versions: 1.20
> Environment: Windows 10 (1809 - 17763.437)
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
>Reporter: Edans Sandes
>Priority: Major
> Attachments: 0001.127_342_5_7955.pdf
>
>
> I could convert the attached pdf using tika-app-1.19.1.jar, but now, in 
> version tika-app-1.20.jar, it stopped working.
> {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
> truncated by 465479
>  at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>  at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)
>  at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
>  at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  ... 5 more
> Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at java.io.BufferedInputStream.read1(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at org.bouncycastle.util.io.Streams.readFully(Unknown Source)
>  at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)
>  at java.io.BufferedInputStream.fill(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at java.io.FilterInputStream.read(Unknown Source)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>  ... 10 more
>  
>  
> {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
> 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
>  {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
> processed.}}
>  {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}}
>  {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
>  {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
>  {{Please provide the jar on your classpath to parse sqlite files.}}
>  {{See tika-parsers/pom.xml for the correct version.}}{{ encoding="UTF-8"?>http://www.w3.org/1999/xhtml";>}}
>  {{}}
>  {{}}{{...CORRECT XML 
> OUTPUT...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-10 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2869.
---
   Resolution: Fixed
 Assignee: Tim Allison
Fix Version/s: 1.21

> Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object 
> truncated by 465479)
> 
>
> Key: TIKA-2869
> URL: https://issues.apache.org/jira/browse/TIKA-2869
> Project: Tika
>  Issue Type: Bug
>  Components: app, cli, parser
>Affects Versions: 1.20
> Environment: Windows 10 (1809 - 17763.437)
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
>Reporter: Edans Sandes
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.21
>
> Attachments: 0001.127_342_5_7955.pdf
>
>
> I could convert the attached pdf using tika-app-1.19.1.jar, but now, in 
> version tika-app-1.20.jar, it stopped working.
> {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
> truncated by 465479
>  at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>  at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)
>  at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
>  at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  ... 5 more
> Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at java.io.BufferedInputStream.read1(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at org.bouncycastle.util.io.Streams.readFully(Unknown Source)
>  at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)
>  at java.io.BufferedInputStream.fill(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at java.io.FilterInputStream.read(Unknown Source)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>  ... 10 more
>  
>  
> {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
> 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
>  {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
> processed.}}
>  {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}}
>  {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
>  {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
>  {{Please provide the jar on your classpath to parse sqlite files.}}
>  {{See tika-parsers/pom.xml for the correct version.}}{{ encoding="UTF-8"?>http://www.w3.org/1999/xhtml";>}}
>  {{}}
>  {{}}{{...CORRECT XML 
> OUTPUT...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-10 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837420#comment-16837420
 ] 

Tim Allison commented on TIKA-2869:
---

I'm able to reproduce this in our 1.x branch but not in our master branch.  
I'll take a look.  Thank you for opening this issue and sharing a triggering 
file!

> Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object 
> truncated by 465479)
> 
>
> Key: TIKA-2869
> URL: https://issues.apache.org/jira/browse/TIKA-2869
> Project: Tika
>  Issue Type: Bug
>  Components: app, cli, parser
>Affects Versions: 1.20
> Environment: Windows 10 (1809 - 17763.437)
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
>Reporter: Edans Sandes
>Priority: Major
> Attachments: 0001.127_342_5_7955.pdf
>
>
> I could convert the attached pdf using tika-app-1.19.1.jar, but now, in 
> version tika-app-1.20.jar, it stopped working.
> {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
> truncated by 465479
>  at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
>  at 
> org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)
>  at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
>  at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  ... 5 more
> Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
>  at java.io.BufferedInputStream.read1(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at org.bouncycastle.util.io.Streams.readFully(Unknown Source)
>  at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)
>  at java.io.BufferedInputStream.fill(Unknown Source)
>  at java.io.BufferedInputStream.read(Unknown Source)
>  at java.io.FilterInputStream.read(Unknown Source)
>  at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
>  ... 10 more
>  
>  
> {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
> 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
>  {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
> processed.}}
>  {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}}
>  {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem}}
&g

[jira] [Updated] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-10 Thread Edans Sandes (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edans Sandes updated TIKA-2869:
---
Description: 
I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version 
tika-app-1.20.jar, it stopped working.

{{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}

{{mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
{{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}}
{{for optional dependencies.}}{{mai 10, 2019 11:36:23 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
{{Please provide the jar on your classpath to parse sqlite files.}}
{{See tika-parsers/pom.xml for the correct version.}}
{{Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e}}
{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)}}
{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}}
{{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}}
{{ at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)}}
{{ at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)}}
{{ at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)}}
{{Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
truncated by 465479}}
{{ at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)}}
{{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)}}
{{ at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)}}
{{ at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)}}
{{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)}}
{{ at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)}}
{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}}
{{ ... 5 more}}
{{Caused by: java.io.EOFException: DEF length 465542 object truncated by 
465479}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at java.io.BufferedInputStream.read1(Unknown Source)}}
{{ at java.io.BufferedInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.util.io.Streams.readFully(Unknown Source)}}
{{ at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown 
Source)}}
{{ at java.io.BufferedInputStream.fill(Unknown Source)}}
{{ at java.io.BufferedInputStream.read(Unknown Source)}}
{{ at java.io.FilterInputStream.read(Unknown Source)}}
{{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)}}
{{ ... 10 more}}

 

 

{{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
 {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
 {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}}
 {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
 {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
 {{Please provide the jar on your classpath to parse sqlite files.}}
 {{See tika-parsers/pom.xml for the correct version.}}{{http://www.w3.org/1999/xhtml";>}}
 {{}}
 {{}}{{...CORRECT XML 
OUTPUT...}}

  was:
I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version 
tika-app-1.20.jar, it stopped working.

{{java -jar {color:#FF}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}

{{}}{{mai 10, 2019 11:20:40 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
{{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}}
{{for optional dependencies.}}{{mai 10, 2019 11:20:40 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
{{Please provide the jar on your classpath to parse sqlite files.}}
{{See tika-parsers/pom.xml for the correct version.}}
{{Exception in thread "main" org.ap

[jira] [Updated] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-10 Thread Edans Sandes (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edans Sandes updated TIKA-2869:
---
Description: 
I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version 
tika-app-1.20.jar, it stopped working.

{{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}

mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
truncated by 465479
 at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
 at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)
 at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)
 at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
 at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 ... 5 more
Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479
 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)
 at java.io.BufferedInputStream.read1(Unknown Source)
 at java.io.BufferedInputStream.read(Unknown Source)
 at org.bouncycastle.util.io.Streams.readFully(Unknown Source)
 at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)
 at java.io.BufferedInputStream.fill(Unknown Source)
 at java.io.BufferedInputStream.read(Unknown Source)
 at java.io.FilterInputStream.read(Unknown Source)
 at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)
 ... 10 more

 

 

{{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
 {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
 {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}}
 {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
 {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
 {{Please provide the jar on your classpath to parse sqlite files.}}
 {{See tika-parsers/pom.xml for the correct version.}}{{http://www.w3.org/1999/xhtml";>}}
 {{}}
 {{}}{{...CORRECT XML 
OUTPUT...}}

  was:
I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version 
tika-app-1.20.jar, it stopped working.

{{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}

{{mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
{{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}}
{{for optional dependencies.}}{{mai 10, 2019 11:36:23 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
{{Please provide the jar on your classpath to parse sqlite files.}}
{{See tika-parsers/pom.xml for the correct version.}}
{{Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e}}
{{ at org.apache.tika.parser.CompositePar

[jira] [Created] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)

2019-05-10 Thread Edans Sandes (JIRA)
Edans Sandes created TIKA-2869:
--

 Summary: Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 
465542 object truncated by 465479)
 Key: TIKA-2869
 URL: https://issues.apache.org/jira/browse/TIKA-2869
 Project: Tika
  Issue Type: Bug
  Components: app, cli, parser
Affects Versions: 1.20
 Environment: Windows 10 (1809 - 17763.437)

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode)
Reporter: Edans Sandes
 Attachments: 0001.127_342_5_7955.pdf

I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version 
tika-app-1.20.jar, it stopped working.

{{java -jar {color:#FF}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}}

{{}}{{mai 10, 2019 11:20:40 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
{{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}}
{{for optional dependencies.}}{{mai 10, 2019 11:20:40 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
{{Please provide the jar on your classpath to parse sqlite files.}}
{{See tika-parsers/pom.xml for the correct version.}}
{{Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e}}
{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)}}
{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}}
{{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}}
{{ at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)}}
{{ at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)}}
{{ at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)}}
{{Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object 
truncated by 465479}}
{{ at 
org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)}}
{{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)}}
{{ at 
org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)}}
{{ at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)}}
{{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)}}
{{ at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)}}
{{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}}
{{ ... 5 more}}
{{Caused by: java.io.EOFException: DEF length 465542 object truncated by 
465479}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}}
{{ at java.io.BufferedInputStream.read1(Unknown Source)}}
{{ at java.io.BufferedInputStream.read(Unknown Source)}}
{{ at org.bouncycastle.util.io.Streams.readFully(Unknown Source)}}
{{ at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown 
Source)}}
{{ at java.io.BufferedInputStream.fill(Unknown Source)}}
{{ at java.io.BufferedInputStream.read(Unknown Source)}}
{{ at java.io.FilterInputStream.read(Unknown Source)}}
{{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)}}
{{ ... 10 more}}

 

 

{{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 
0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be 
processed.}}
{{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}}
{{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM 
org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem}}
{{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}}
{{Please provide the jar on your classpath to parse sqlite files.}}
{{See tika-parsers/pom.xml for the correct version.}}{{http://www.w3.org/1999/xhtml";>}}
{{}}
{{}}{{...CORRECT XML 
OUTPUT...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2855) pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable

2019-04-19 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2855.
---
Resolution: Duplicate

Thank you!

> pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable
> -
>
> Key: TIKA-2855
> URL: https://issues.apache.org/jira/browse/TIKA-2855
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.19.1
>Reporter: Abhijit Rajwade
>Priority: Major
>
> As per Sonatype Nexus Auditor, pdfbox versions upto 2.0.14 are vulnerable to
> "CVE-2019-0228: possible XML External Entity (XXE) attack".
> Recommended fix is to upgrade to pdfbox version 2.0.15
> Refer following pdfbox issue 
>   https://issues.apache.org/jira/browse/PDFBOX-4505 
> which is fixed on version 2.0.15
> Can you please upgrade Apache Tika to use pdfbox 2.0.15?
> Following are details from the Sonatype Nexus scan report
> Issue: CVE-2019-0228 
> Severity: Sonatype CVSS 3.0: 7.3 
> Weakness: Sonatype CWE: 611 
> Source: National Vulnerability Database 
> Categories: Data 
> Description from CVE: apache pdfbox - XML External Entity (XXE) 
> Root Cause: pdfbox-2.0.12.jar : ( , 2.0.15) 
> Advisories:
> Project: https://github.com/apache/pdfbox-docs/commit/b7869c3e4c62c5d...
> Project: https://issues.apache.org/jira/browse/PDFBOX-4505
> Third Party: https://bugzilla.redhat.com/show_bug.cgi?id=1699740 
> CVSS Details:
> Sonatype CVSS 3.0: 7.3
> CVSS Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2855) pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable

2019-04-18 Thread Abhijit Rajwade (JIRA)
Abhijit Rajwade created TIKA-2855:
-

 Summary: pdfbox version used by both Apache Tika 1.19.1 and 1.20 
is vulnerable
 Key: TIKA-2855
 URL: https://issues.apache.org/jira/browse/TIKA-2855
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 1.19.1
Reporter: Abhijit Rajwade


As per Sonatype Nexus Auditor, pdfbox versions upto 2.0.14 are vulnerable to
"CVE-2019-0228: possible XML External Entity (XXE) attack".

Recommended fix is to upgrade to pdfbox version 2.0.15
Refer following pdfbox issue 
  https://issues.apache.org/jira/browse/PDFBOX-4505 
which is fixed on version 2.0.15

Can you please upgrade Apache Tika to use pdfbox 2.0.15?

Following are details from the Sonatype Nexus scan report

Issue: CVE-2019-0228 
Severity: Sonatype CVSS 3.0: 7.3 
Weakness: Sonatype CWE: 611 
Source: National Vulnerability Database 
Categories: Data 

Description from CVE: apache pdfbox - XML External Entity (XXE) 
Root Cause: pdfbox-2.0.12.jar : ( , 2.0.15) 
Advisories:
Project: https://github.com/apache/pdfbox-docs/commit/b7869c3e4c62c5d...
Project: https://issues.apache.org/jira/browse/PDFBOX-4505
Third Party: https://bugzilla.redhat.com/show_bug.cgi?id=1699740 
CVSS Details:
Sonatype CVSS 3.0: 7.3
CVSS Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Tim Allison
Thank you, Oleg and Ken!

On Sat, Dec 22, 2018 at 6:57 AM Oleg Tikhonov  wrote:
>
> *stuff
>
> On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov 
> > All basic staff passed.
> > +1.
> > Oleg
> >
> > On Fri, Dec 21, 2018, 22:02 Ken Krugler  > wrote:
> >
> >> Hi Tim,
> >>
> >> Thanks for rolling the release.
> >>
> >> Built & validated on Mac OS X 10.12
> >>
> >> Updated flink-crawler, all tests pass.
> >>
> >> So here’s my +1
> >>
> >> — Ken
> >>
> >>
> >> > On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
> >> >
> >> > A candidate for the Tika 1.20 release is available at:
> >> >
> >> >  https://dist.apache.org/repos/dist/dev/tika/
> >> >
> >> > The release candidate is a zip archive of the sources in:
> >> >  https://github.com/apache/tika/tree/1.20-rc1/
> >> >
> >> > The SHA-512 checksum of the archive is
> >> >
> >> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
> >> >
> >> > In addition, a staged maven repository is available here:
> >> >
> >> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
> >> >
> >> >
> >> > Please vote on releasing this package as Apache Tika 1.20.
> >> >
> >> > The vote is open for the next 72 hours and passes if a majority of at
> >> > least three +1 Tika PMC votes are cast.
> >> >
> >> > [ ] +1 Release this package as Apache Tika 1.20
> >> > [ ] -1 Do not release this package because...
> >> >
> >> > Here's my +1.
> >> >
> >> > Cheers,
> >> >
> >> >  Tim
> >>
> >> --
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> Custom big data solutions & training
> >> Flink, Solr, Hadoop, Cascading & Cassandra
> >>
> >>


[ANNOUNCE] Apache Tika 1.20 released

2018-12-22 Thread Tim Allison
The Apache Tika project is pleased to announce the release of Apache Tika
1.20. The release contents have been pushed out to the main Apache
release site and to the Maven Central sync, so the releases should be
available as soon as the mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries.

Apache Tika 1.20 contains a number of improvements and bug fixes.
Details can be found in the changes file:
https://www.apache.org/dist/tika/CHANGES-1.20.txt

Apache Tika is available on the download page:
https://tika.apache.org/download.html

Apache Tika is also available in binary form or for use using Maven 2
from the Central Repository:
https://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the
downloads using signatures found:
https://www.apache.org/dist/tika/KEYS

For more information on Apache Tika, visit the project home page:
https://tika.apache.org/

-- Tim Allison, on behalf of the Apache Tika community


[RESULT][VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Tim Allison
The vote has passed:
+1 from
Oleg Tikhonov
Ken Krugler
Tim Allison

no -1

Cheers,

   Tim

On Sat, Dec 22, 2018 at 6:57 AM Oleg Tikhonov  wrote:
>
> *stuff
>
> On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov 
> > All basic staff passed.
> > +1.
> > Oleg
> >
> > On Fri, Dec 21, 2018, 22:02 Ken Krugler  > wrote:
> >
> >> Hi Tim,
> >>
> >> Thanks for rolling the release.
> >>
> >> Built & validated on Mac OS X 10.12
> >>
> >> Updated flink-crawler, all tests pass.
> >>
> >> So here’s my +1
> >>
> >> — Ken
> >>
> >>
> >> > On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
> >> >
> >> > A candidate for the Tika 1.20 release is available at:
> >> >
> >> >  https://dist.apache.org/repos/dist/dev/tika/
> >> >
> >> > The release candidate is a zip archive of the sources in:
> >> >  https://github.com/apache/tika/tree/1.20-rc1/
> >> >
> >> > The SHA-512 checksum of the archive is
> >> >
> >> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
> >> >
> >> > In addition, a staged maven repository is available here:
> >> >
> >> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
> >> >
> >> >
> >> > Please vote on releasing this package as Apache Tika 1.20.
> >> >
> >> > The vote is open for the next 72 hours and passes if a majority of at
> >> > least three +1 Tika PMC votes are cast.
> >> >
> >> > [ ] +1 Release this package as Apache Tika 1.20
> >> > [ ] -1 Do not release this package because...
> >> >
> >> > Here's my +1.
> >> >
> >> > Cheers,
> >> >
> >> >  Tim
> >>
> >> --
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> Custom big data solutions & training
> >> Flink, Solr, Hadoop, Cascading & Cassandra
> >>
> >>


Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Oleg Tikhonov
*stuff

On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov  All basic staff passed.
> +1.
> Oleg
>
> On Fri, Dec 21, 2018, 22:02 Ken Krugler  wrote:
>
>> Hi Tim,
>>
>> Thanks for rolling the release.
>>
>> Built & validated on Mac OS X 10.12
>>
>> Updated flink-crawler, all tests pass.
>>
>> So here’s my +1
>>
>> — Ken
>>
>>
>> > On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
>> >
>> > A candidate for the Tika 1.20 release is available at:
>> >
>> >  https://dist.apache.org/repos/dist/dev/tika/
>> >
>> > The release candidate is a zip archive of the sources in:
>> >  https://github.com/apache/tika/tree/1.20-rc1/
>> >
>> > The SHA-512 checksum of the archive is
>> >
>> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
>> >
>> > In addition, a staged maven repository is available here:
>> >
>> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
>> >
>> >
>> > Please vote on releasing this package as Apache Tika 1.20.
>> >
>> > The vote is open for the next 72 hours and passes if a majority of at
>> > least three +1 Tika PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Tika 1.20
>> > [ ] -1 Do not release this package because...
>> >
>> > Here's my +1.
>> >
>> > Cheers,
>> >
>> >  Tim
>>
>> --
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>>
>>


Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Oleg Tikhonov
All basic staff passed.
+1.
Oleg

On Fri, Dec 21, 2018, 22:02 Ken Krugler  Hi Tim,
>
> Thanks for rolling the release.
>
> Built & validated on Mac OS X 10.12
>
> Updated flink-crawler, all tests pass.
>
> So here’s my +1
>
> — Ken
>
>
> > On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
> >
> > A candidate for the Tika 1.20 release is available at:
> >
> >  https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >  https://github.com/apache/tika/tree/1.20-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
> >
> >
> > Please vote on releasing this package as Apache Tika 1.20.
> >
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.20
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >  Tim
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>


Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-21 Thread Ken Krugler
Hi Tim,

Thanks for rolling the release.

Built & validated on Mac OS X 10.12

Updated flink-crawler, all tests pass.

So here’s my +1

— Ken


> On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
> 
> A candidate for the Tika 1.20 release is available at:
> 
>  https://dist.apache.org/repos/dist/dev/tika/
> 
> The release candidate is a zip archive of the sources in:
>  https://github.com/apache/tika/tree/1.20-rc1/
> 
> The SHA-512 checksum of the archive is
> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
> 
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
> 
> 
> Please vote on releasing this package as Apache Tika 1.20.
> 
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 1.20
> [ ] -1 Do not release this package because...
> 
> Here's my +1.
> 
> Cheers,
> 
>  Tim

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra



Re: 1.20?

2018-12-18 Thread Tim Allison
Reports on mp4s, junrar, msaccess and a random subset of the
regression corpus are available here:
http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz


On Thu, Dec 13, 2018 at 5:34 PM Tim Allison  wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison  wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif  
>> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  
>> > > > > wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds 
>> > > > > from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >


[VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-17 Thread Tim Allison
A candidate for the Tika 1.20 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  https://github.com/apache/tika/tree/1.20-rc1/

The SHA-512 checksum of the archive is
add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.

In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika


Please vote on releasing this package as Apache Tika 1.20.

The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.20
[ ] -1 Do not release this package because...

Here's my +1.

Cheers,

  Tim


Re: 1.20?

2018-12-14 Thread Tim Allison
Thank you, again, Luís Filipe Nassif!  There's no point in having
reports unless we pay attention to them :P.  I reverted junrar to
where it was in 1.19.1. I also reverted jackcess based on the reports.

All,
  On the theory that it isn't a great idea to push to production on a
Friday.  I'm going to let the recent changes rest over the weekend.
I'll rerun some tests on a subset of the regression corpus on Monday
and then roll rc1.  If anyone wants to kick the tires on the recent
version changes, including parsers that depend on the upgraded guava,
that'd be great!

Onward!

Cheers,

   Tim

On Thu, Dec 13, 2018 at 5:34 PM Tim Allison  wrote:
>
> Let me actually take a look before answering. Sorry!
>
> On Thu, Dec 13, 2018 at 5:30 PM Tim Allison  wrote:
>>
>>  Thank you for reading the reports!!!
>>
>> The files are very likely broken.  I can take a look.  The change was
>> probably because of an "upgrade" to junrar.  Should I revert to the
>> version we used in 1.19.1?
>> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif  
>> wrote:
>> >
>> > Hi Tim,
>> >
>> > Reading your great reports, I also saw some new exceptions with RAR files
>> > in likely broken folder, but seems tika was able to extract some text from
>> > them before. Do you know if those files are really broken and why tika
>> > extracted text from them before?
>> >
>> > Thank you,
>> > Luis
>> >
>> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
>> > escreveu:
>> >
>> > > Reports are here:
>> > >
>> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>> > >
>> > > I'm going to revert the mp4 parser, and commit the few dependency
>> > > upgrades I ran.
>> > >
>> > > The _major_ difference in content for ppt is explained by the
>> > > duplication of header/footer info.  To confirm this, note that the
>> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
>> > > identical for nearly all ppt->ppt, but there are far more tokens in
>> > > "num_tokens_a" vs "num_tokens_b".
>> > >
>> > > I also see that we're losing content in x-java and x-groovy, etc., but
>> > > that's because we're now suppressing the style markup that our parser
>> > > was (incorrectly, IMHO, inserting) -- check the values in
>> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
>> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
>> > > weight: 3 | family: 2
>> > >
>> > > In short, I think we're good to go.  Will roll rc1 later today or
>> > > (more likely) tomorrow unless there are objections.
>> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
>> > > >
>> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
>> > > shortly.
>> > > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
>> > > > >
>> > > > > Hi,
>> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  
>> > > > > wrote:
>> > > > >
>> > > > > > Dave,
>> > > > > >   Should I try to get the Docker plugin working again?
>> > > > > >
>> > > > >
>> > > > > That would be great. I think I may have went down the wrong path
>> > > building
>> > > > > an image at package time, as there doesn't seem to be an easy way to
>> > > > > publish it as an Apache labelled org on Dockerhub unless it builds 
>> > > > > from
>> > > > > source.
>> > > > >
>> > > > > I have some time over the weekend, so could update to where I got to
>> > > and
>> > > > > see what you think.
>> > > > >
>> > > > > Cheers,
>> > > > > Dave
>> > >


Re: 1.20?

2018-12-13 Thread Tim Allison
Let me actually take a look before answering. Sorry!

On Thu, Dec 13, 2018 at 5:30 PM Tim Allison  wrote:

>  Thank you for reading the reports!!!
>
> The files are very likely broken.  I can take a look.  The change was
> probably because of an "upgrade" to junrar.  Should I revert to the
> version we used in 1.19.1?
> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif 
> wrote:
> >
> > Hi Tim,
> >
> > Reading your great reports, I also saw some new exceptions with RAR files
> > in likely broken folder, but seems tika was able to extract some text
> from
> > them before. Do you know if those files are really broken and why tika
> > extracted text from them before?
> >
> > Thank you,
> > Luis
> >
> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
> > escreveu:
> >
> > > Reports are here:
> > >
> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> > >
> > > I'm going to revert the mp4 parser, and commit the few dependency
> > > upgrades I ran.
> > >
> > > The _major_ difference in content for ppt is explained by the
> > > duplication of header/footer info.  To confirm this, note that the
> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > > identical for nearly all ppt->ppt, but there are far more tokens in
> > > "num_tokens_a" vs "num_tokens_b".
> > >
> > > I also see that we're losing content in x-java and x-groovy, etc., but
> > > that's because we're now suppressing the style markup that our parser
> > > was (incorrectly, IMHO, inserting) -- check the values in
> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > > weight: 3 | family: 2
> > >
> > > In short, I think we're good to go.  Will roll rc1 later today or
> > > (more likely) tomorrow unless there are objections.
> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison 
> wrote:
> > > >
> > > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > > shortly.
> > > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > > > >
> > > > > Hi,
> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison 
> wrote:
> > > > >
> > > > > > Dave,
> > > > > >   Should I try to get the Docker plugin working again?
> > > > > >
> > > > >
> > > > > That would be great. I think I may have went down the wrong path
> > > building
> > > > > an image at package time, as there doesn't seem to be an easy way
> to
> > > > > publish it as an Apache labelled org on Dockerhub unless it builds
> from
> > > > > source.
> > > > >
> > > > > I have some time over the weekend, so could update to where I got
> to
> > > and
> > > > > see what you think.
> > > > >
> > > > > Cheers,
> > > > > Dave
> > >
>


Re: 1.20?

2018-12-13 Thread Tim Allison
 Thank you for reading the reports!!!

The files are very likely broken.  I can take a look.  The change was
probably because of an "upgrade" to junrar.  Should I revert to the
version we used in 1.19.1?
On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif  wrote:
>
> Hi Tim,
>
> Reading your great reports, I also saw some new exceptions with RAR files
> in likely broken folder, but seems tika was able to extract some text from
> them before. Do you know if those files are really broken and why tika
> extracted text from them before?
>
> Thank you,
> Luis
>
> Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
> escreveu:
>
> > Reports are here:
> >
> > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
> >
> > I'm going to revert the mp4 parser, and commit the few dependency
> > upgrades I ran.
> >
> > The _major_ difference in content for ppt is explained by the
> > duplication of header/footer info.  To confirm this, note that the
> > values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> > identical for nearly all ppt->ppt, but there are far more tokens in
> > "num_tokens_a" vs "num_tokens_b".
> >
> > I also see that we're losing content in x-java and x-groovy, etc., but
> > that's because we're now suppressing the style markup that our parser
> > was (incorrectly, IMHO, inserting) -- check the values in
> > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> > weight: 3 | family: 2
> >
> > In short, I think we're good to go.  Will roll rc1 later today or
> > (more likely) tomorrow unless there are objections.
> > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
> > >
> > > Any blockers on 1.20?  I'm going to kick off the regression tests
> > shortly.
> > > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > > >
> > > > Hi,
> > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> > > >
> > > > > Dave,
> > > > >   Should I try to get the Docker plugin working again?
> > > > >
> > > >
> > > > That would be great. I think I may have went down the wrong path
> > building
> > > > an image at package time, as there doesn't seem to be an easy way to
> > > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > > source.
> > > >
> > > > I have some time over the weekend, so could update to where I got to
> > and
> > > > see what you think.
> > > >
> > > > Cheers,
> > > > Dave
> >


Re: 1.20?

2018-12-13 Thread Luís Filipe Nassif
Hi Tim,

Reading your great reports, I also saw some new exceptions with RAR files
in likely broken folder, but seems tika was able to extract some text from
them before. Do you know if those files are really broken and why tika
extracted text from them before?

Thank you,
Luis

Em qui, 13 de dez de 2018 às 13:02, Tim Allison 
escreveu:

> Reports are here:
>
> http://162.242.228.174/reports/tika_1_20-pre-rc1.zip
>
> I'm going to revert the mp4 parser, and commit the few dependency
> upgrades I ran.
>
> The _major_ difference in content for ppt is explained by the
> duplication of header/footer info.  To confirm this, note that the
> values for "num_unique_tokens_a" and "num_unique_tokens_b" are
> identical for nearly all ppt->ppt, but there are far more tokens in
> "num_tokens_a" vs "num_tokens_b".
>
> I also see that we're losing content in x-java and x-groovy, etc., but
> that's because we're now suppressing the style markup that our parser
> was (incorrectly, IMHO, inserting) -- check the values in
> "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
> 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
> weight: 3 | family: 2
>
> In short, I think we're good to go.  Will roll rc1 later today or
> (more likely) tomorrow unless there are objections.
> On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
> >
> > Any blockers on 1.20?  I'm going to kick off the regression tests
> shortly.
> > On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> > >
> > > Hi,
> > > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> > >
> > > > Dave,
> > > >   Should I try to get the Docker plugin working again?
> > > >
> > >
> > > That would be great. I think I may have went down the wrong path
> building
> > > an image at package time, as there doesn't seem to be an easy way to
> > > publish it as an Apache labelled org on Dockerhub unless it builds from
> > > source.
> > >
> > > I have some time over the weekend, so could update to where I got to
> and
> > > see what you think.
> > >
> > > Cheers,
> > > Dave
>


Re: 1.20?

2018-12-13 Thread Chris Mattmann
Roll forward! Yay!

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, December 13, 2018 at 7:02 AM
To: "dev@tika.apache.org" 
Subject: Re: 1.20?

 

Reports are here:

 

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

 

I'm going to revert the mp4 parser, and commit the few dependency

upgrades I ran.

 

The _major_ difference in content for ppt is explained by the

duplication of header/footer info.  To confirm this, note that the

values for "num_unique_tokens_a" and "num_unique_tokens_b" are

identical for nearly all ppt->ppt, but there are far more tokens in

"num_tokens_a" vs "num_tokens_b".

 

I also see that we're losing content in x-java and x-groovy, etc., but

that's because we're now suppressing the style markup that our parser

was (incorrectly, IMHO, inserting) -- check the values in

"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |

0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |

weight: 3 | family: 2

 

In short, I think we're good to go.  Will roll rc1 later today or

(more likely) tomorrow unless there are objections.

On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:

 

Any blockers on 1.20?  I'm going to kick off the regression tests shortly.

On Fri, Nov 30, 2018 at 7:39 PM  wrote:

> 

> Hi,

> On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:

> 

> > Dave,

> >   Should I try to get the Docker plugin working again?

> >

> 

> That would be great. I think I may have went down the wrong path building

> an image at package time, as there doesn't seem to be an easy way to

> publish it as an Apache labelled org on Dockerhub unless it builds from

> source.

> 

> I have some time over the weekend, so could update to where I got to and

> see what you think.

> 

> Cheers,

> Dave

 



Re: 1.20?

2018-12-13 Thread Tim Allison
Reports are here:

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

I'm going to revert the mp4 parser, and commit the few dependency
upgrades I ran.

The _major_ difference in content for ppt is explained by the
duplication of header/footer info.  To confirm this, note that the
values for "num_unique_tokens_a" and "num_unique_tokens_b" are
identical for nearly all ppt->ppt, but there are far more tokens in
"num_tokens_a" vs "num_tokens_b".

I also see that we're losing content in x-java and x-groovy, etc., but
that's because we're now suppressing the style markup that our parser
was (incorrectly, IMHO, inserting) -- check the values in
"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |
0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |
weight: 3 | family: 2

In short, I think we're good to go.  Will roll rc1 later today or
(more likely) tomorrow unless there are objections.
On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:
>
> Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
> On Fri, Nov 30, 2018 at 7:39 PM  wrote:
> >
> > Hi,
> > On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
> >
> > > Dave,
> > >   Should I try to get the Docker plugin working again?
> > >
> >
> > That would be great. I think I may have went down the wrong path building
> > an image at package time, as there doesn't seem to be an easy way to
> > publish it as an Apache labelled org on Dockerhub unless it builds from
> > source.
> >
> > I have some time over the weekend, so could update to where I got to and
> > see what you think.
> >
> > Cheers,
> > Dave


Re: 1.20?

2018-12-10 Thread Tim Allison
Any blockers on 1.20?  I'm going to kick off the regression tests shortly.
On Fri, Nov 30, 2018 at 7:39 PM  wrote:
>
> Hi,
> On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:
>
> > Dave,
> >   Should I try to get the Docker plugin working again?
> >
>
> That would be great. I think I may have went down the wrong path building
> an image at package time, as there doesn't seem to be an easy way to
> publish it as an Apache labelled org on Dockerhub unless it builds from
> source.
>
> I have some time over the weekend, so could update to where I got to and
> see what you think.
>
> Cheers,
> Dave


Re: 1.20?

2018-11-30 Thread loompa
Hi,
On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:

> Dave,
>   Should I try to get the Docker plugin working again?
>

That would be great. I think I may have went down the wrong path building
an image at package time, as there doesn't seem to be an easy way to
publish it as an Apache labelled org on Dockerhub unless it builds from
source.

I have some time over the weekend, so could update to where I got to and
see what you think.

Cheers,
Dave


Re: 1.20?

2018-11-28 Thread Lewis John McGibbney
+1 would be nice to get the recent ENVI work released as well folks. 

On 2018/11/20 23:04:29, Tim Allison  wrote: 
> All,
>POI 4.0.1 will be out shortly with some important bug fixes.  What would
> you all think of targeting 1st/2nd week of December for 1.20?
> 
>  Cheers,
>  Tim
> 


Re: 1.20?

2018-11-21 Thread Tim Allison
Dave,
  Should I try to get the Docker plugin working again?

On Tue, Nov 20, 2018 at 6:21 PM Chris Mattmann  wrote:

> Love it and I can align tika-python with that too ☺
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, November 20, 2018 at 3:04 PM
> To: "dev@tika.apache.org" 
> Subject: 1.20?
>
>
>
> All,
>
>POI 4.0.1 will be out shortly with some important bug fixes.  What would
>
> you all think of targeting 1st/2nd week of December for 1.20?
>
>
>
>  Cheers,
>
>  Tim
>
>
>
>


Re: 1.20?

2018-11-20 Thread Chris Mattmann
Love it and I can align tika-python with that too ☺

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, November 20, 2018 at 3:04 PM
To: "dev@tika.apache.org" 
Subject: 1.20?

 

All,

   POI 4.0.1 will be out shortly with some important bug fixes.  What would

you all think of targeting 1st/2nd week of December for 1.20?

 

 Cheers,

 Tim

 



1.20?

2018-11-20 Thread Tim Allison
All,
   POI 4.0.1 will be out shortly with some important bug fixes.  What would
you all think of targeting 1st/2nd week of December for 1.20?

 Cheers,
 Tim