[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413470#comment-17413470 ] Tilman Hausherr commented on TIKA-3544: --- No. Use strings, that is the issue. > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-3544. - Resolution: Won't Fix > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413469#comment-17413469 ] Jitin Jindal commented on TIKA-3544: So we aren’t fixing this issue ? > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412084#comment-17412084 ] Dave Fisher commented on TIKA-3544: --- The OP's source [https://getcreditcardnumbers.com|https://getcreditcardnumbers.com/] produces invalid numbers. In JSON and Javascript Numbers are always double precision floating point. See [https://www.w3schools.com/js/js_numbers.asp] > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412072#comment-17412072 ] Tim Allison commented on TIKA-3544: --- >Use strings. +1 > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412035#comment-17412035 ] Dave Fisher commented on TIKA-3544: --- See [https://en.wikipedia.org/wiki/Double-precision_floating-point_format] Double can only keep between 15-17 digits of precision. I think you have to leave things at 15 digits or do more precise analysis which would be slower. There is a reason why there is an error term called epsilon with floating point. Credit Card Numbers are Strings of Numeric Characters. Use strings. Just like you have to use for US Zipcodes due to leading '0'. > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412025#comment-17412025 ] Tim Allison commented on TIKA-3544: --- Oh, this is hilarious, if I type '6480195344542781' (16 digits), Excel automatically floors that to '6480195344542780' which means Excel is corrupting 16 digit credit card numbers that do not happen to end in zero! I note that Excel is not rounding; it also floors '6480195344542789' to '6480195344542780' So, y, we could bump it to 16, but that would be wrong 90% of the time... I'm now inclined to propose that we not do anything here. Note: This is Excel for Mac (16.52), your mileage may vary. > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412012#comment-17412012 ] Tim Allison edited comment on TIKA-3544 at 9/8/21, 3:40 PM: In TIKA-2025 (which is nearly exactly this issue), we added a custom TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat. This broke Excel and POI's default handling to allow up to 15 digits to be extracted. When I look at the underlying xml of the attached file, 6480195344642780 is, in fact, stored there. If we bump our custom handling to 16 digits this problem would be solved _for this file_ and for numbers with 16 digits. As Tilman and Nick note, though, Excel is really bad for numbers that might start with leading zeros, like credit card #s, etc. You have to be really careful to enter them as strings or, better yet, use an actual database. was (Author: talli...@mitre.org): In TIKA-2025 (which is nearly exactly this issue), we added a custom TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat. This broke Excel and POI's default handling to allow up to 15 digits to be extracted. When I look at the underlying xml, 6480195344642780 is, in fact, stored there. If we bump our custom handling to 16 digits this problem would be solved _for this file_ and for numbers with 16 digits. As Tilman and Nick note, though, Excel is really bad for numbers that might start with leading zeros, like credit card #s, etc. You have to be really careful to enter them as strings or, better yet, use an actual database. > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412018#comment-17412018 ] Tim Allison commented on TIKA-3544: --- So, should we bump 15->16? > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412016#comment-17412016 ] Tim Allison commented on TIKA-3544: --- Y, I just tried bumping 15->16, and we get this output: Credit Card Numbers (Source: http://www.getcreditcardnumbers.com/) 6480195344642780 30295201231669 30082494556063 344850003945824 358338792630 3587385370593640 > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412012#comment-17412012 ] Tim Allison commented on TIKA-3544: --- In TIKA-2025 (which is nearly exactly this issue), we added a custom TikaExcelDataFormatter that allowed us to inject TikaExcelGeneralFormat. This broke Excel and POI's default handling to allow up to 15 digits to be extracted. When I look at the underlying xml, 6480195344642780 is, in fact, stored there. If we bump our custom handling to 16 digits this problem would be solved _for this file_ and for numbers with 16 digits. As Tilman and Nick note, though, Excel is really bad for numbers that might start with leading zeros, like credit card #s, etc. You have to be really careful to enter them as strings or, better yet, use an actual database. > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411814#comment-17411814 ] Nick Burch commented on TIKA-3544: -- Apache POI provides the DataFormatter class which attempts to turn the number into a string similar to the one shown in Excel, based on the formatting rules applied to the cell. That ought to be being used by Tika. Doesn't help completely if Excel has thrown away the last few digits though... > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411806#comment-17411806 ] Tilman Hausherr commented on TIKA-3544: --- Yeah it's that crazy. I have a spreadsheet from a client with staff id numbers. These are stored as numbers so I use Apache POI (and so does tika) and I have to call {{row.getCell(0).getNumericCellValue()}} which returns a double. Using {{getStringCellValue()}} instead brings an IllegalStateException "Cannot get a STRING value from a NUMERIC cell". > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411774#comment-17411774 ] Nick Burch commented on TIKA-3544: -- You need to be aware that Excel itself only stored numbers-as-numbers with a certain amount of precision (~15 digits). Any very long numbers will always risk having data and precision lost if stored as a number in Excel. You need to store those as strings (eg with a ' prefix) to avoid data loss See [https://www.microsoft.com/en-us/microsoft-365/blog/2008/04/10/understanding-floating-point-precision-aka-why-does-excel-give-me-seemingly-wrong-answers/] for more info on this from Microsoft that you may wish to share with the people generating your spreadsheets with the risk of data loss > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710 ] Tilman Hausherr commented on TIKA-3544: --- It seems to depend on the value: {noformat} http://www.w3.org/1999/xhtml";> Payments - Payment Details Payment Details Credit Card Numbers (Source: http://www.getcreditcardnumbers.com/) 6,48019534464278E+15 30295201231669 30082494556063 344850003945824 3,5833879263E+15 3,58738537059364E+15 &"Helvetica,Regular"&12&K00&P http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/ {noformat} > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710 ] Tilman Hausherr edited comment on TIKA-3544 at 9/8/21, 6:15 AM: It seems to depend on the value (this output done with 2.1.1): {noformat} http://www.w3.org/1999/xhtml";> Payments - Payment Details Payment Details Credit Card Numbers (Source: http://www.getcreditcardnumbers.com/) 6,48019534464278E+15 30295201231669 30082494556063 344850003945824 3,5833879263E+15 3,58738537059364E+15 &"Helvetica,Regular"&12&K00&P http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/ {noformat} was (Author: tilman): It seems to depend on the value: {noformat} http://www.w3.org/1999/xhtml";> Payments - Payment Details Payment Details Credit Card Numbers (Source: http://www.getcreditcardnumbers.com/) 6,48019534464278E+15 30295201231669 30082494556063 344850003945824 3,5833879263E+15 3,58738537059364E+15 &"Helvetica,Regular"&12&K00&P http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/ {noformat} > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jitin Jindal updated TIKA-3544: --- Attachment: Credit Card Numbers.xlsx > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jitin Jindal updated TIKA-3544: --- Attachment: (was: Book1.xlsx) > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jitin Jindal updated TIKA-3544: --- Attachment: Book1.xlsx > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Book1.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
Jitin Jindal created TIKA-3544: -- Summary: Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results Key: TIKA-3544 URL: https://issues.apache.org/jira/browse/TIKA-3544 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.20 Reporter: Jitin Jindal If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation. For example, the credit card number “6011799905775830” is extracted from the attached spreadsheet as 6.480195344642784E15, which clearly is not the desired output. I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-2877) Tika 1.20 suffer from 3 separate CVE vulnerabilities
[ https://issues.apache.org/jira/browse/TIKA-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2877. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 1.21 Will commit updates to site shortly and announce release of 1.21. > Tika 1.20 suffer from 3 separate CVE vulnerabilities > > > Key: TIKA-2877 > URL: https://issues.apache.org/jira/browse/TIKA-2877 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.20 > Environment: These are generic issues. >Reporter: Pat cashman >Assignee: Tim Allison >Priority: Critical > Fix For: 1.21 > > > Tika 1.20 third party dependencies suffer from 3 separate CVE > vulnerabilitiesoutlined below > I am aware that these are already included in a separate ticket which deals > with the generic problem of outdated 3rd party libraries. > [https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2854] > At the very least you should update your security page with the details and > potentially release 1.21 to correct these issues.. > [https://tika.apache.org/security.html] > > *a) GUAVA v_17 -> - CVE-2018-10237* > Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 > allows remote attackers to conduct denial of service attacks against servers > [https://nvd.nist.gov/vuln/detail//CVE-2018-10237] > > *b) jackson-databind v_2.9.7 -> CVE-2018-19362* > FasterXML jackson-databind 2.x before 2.9.8 might allow attackers to have > unspecified impact by leveraging failure to block the jboss-common-core class > from polymorphic deserialization. > [https://nvd.nist.gov/vuln/detail/CVE-2018-19362] > > *c) sqlite-jdbc v_3.25.2 ->CVE-2018-20346* > SQLite before 3.25.3, when the FTS3 extension is enabled, encounters an > integer overflow (and resultant buffer overflow) for FTS3 queries that occur > after crafted changes to FTS3 shadow tables, allowing remote attackers to > execute arbitrary code by leveraging the ability to run arbitrary SQL > statements (such as in certain WebSQL use cases), aka Magellan. > [https://nvd.nist.gov/vuln/detail/CVE-2018-20346] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2877) Tika 1.20 suffer from 3 separate CVE vulnerabilities
[ https://issues.apache.org/jira/browse/TIKA-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841492#comment-16841492 ] Tim Allison commented on TIKA-2877: --- Voting is underway for 1.21 : https://lists.apache.org/thread.html/2c027535156cc6862149490b289552d72ba5a9bff985fb7cce794e21@%3Cdev.tika.apache.org%3E I can add a new table for dependency vulnerabilities on our security page. Thank you. > Tika 1.20 suffer from 3 separate CVE vulnerabilities > > > Key: TIKA-2877 > URL: https://issues.apache.org/jira/browse/TIKA-2877 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 1.20 > Environment: These are generic issues. >Reporter: Pat cashman >Priority: Critical > > Tika 1.20 third party dependencies suffer from 3 separate CVE > vulnerabilitiesoutlined below > I am aware that these are already included in a separate ticket which deals > with the generic problem of outdated 3rd party libraries. > [https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2854] > At the very least you should update your security page with the details and > potentially release 1.21 to correct these issues.. > [https://tika.apache.org/security.html] > > *a) GUAVA v_17 -> - CVE-2018-10237* > Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 > allows remote attackers to conduct denial of service attacks against servers > [https://nvd.nist.gov/vuln/detail//CVE-2018-10237] > > *b) jackson-databind v_2.9.7 -> CVE-2018-19362* > FasterXML jackson-databind 2.x before 2.9.8 might allow attackers to have > unspecified impact by leveraging failure to block the jboss-common-core class > from polymorphic deserialization. > [https://nvd.nist.gov/vuln/detail/CVE-2018-19362] > > *c) sqlite-jdbc v_3.25.2 ->CVE-2018-20346* > SQLite before 3.25.3, when the FTS3 extension is enabled, encounters an > integer overflow (and resultant buffer overflow) for FTS3 queries that occur > after crafted changes to FTS3 shadow tables, allowing remote attackers to > execute arbitrary code by leveraging the ability to run arbitrary SQL > statements (such as in certain WebSQL use cases), aka Magellan. > [https://nvd.nist.gov/vuln/detail/CVE-2018-20346] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2877) Tika 1.20 suffer from 3 separate CVE vulnerabilities
Pat cashman created TIKA-2877: - Summary: Tika 1.20 suffer from 3 separate CVE vulnerabilities Key: TIKA-2877 URL: https://issues.apache.org/jira/browse/TIKA-2877 Project: Tika Issue Type: Bug Components: app Affects Versions: 1.20 Environment: These are generic issues. Reporter: Pat cashman Tika 1.20 third party dependencies suffer from 3 separate CVE vulnerabilitiesoutlined below I am aware that these are already included in a separate ticket which deals with the generic problem of outdated 3rd party libraries. [https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2854] At the very least you should update your security page with the details and potentially release 1.21 to correct these issues.. [https://tika.apache.org/security.html] *a) GUAVA v_17 -> - CVE-2018-10237* Unbounded memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows remote attackers to conduct denial of service attacks against servers [https://nvd.nist.gov/vuln/detail//CVE-2018-10237] *b) jackson-databind v_2.9.7 -> CVE-2018-19362* FasterXML jackson-databind 2.x before 2.9.8 might allow attackers to have unspecified impact by leveraging failure to block the jboss-common-core class from polymorphic deserialization. [https://nvd.nist.gov/vuln/detail/CVE-2018-19362] *c) sqlite-jdbc v_3.25.2 ->CVE-2018-20346* SQLite before 3.25.3, when the FTS3 extension is enabled, encounters an integer overflow (and resultant buffer overflow) for FTS3 queries that occur after crafted changes to FTS3 shadow tables, allowing remote attackers to execute arbitrary code by leveraging the ability to run arbitrary SQL statements (such as in certain WebSQL use cases), aka Magellan. [https://nvd.nist.gov/vuln/detail/CVE-2018-20346] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
[ https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838843#comment-16838843 ] Tim Allison commented on TIKA-2869: --- I doubly confirmed that this file now parses with 1.21-rc1: https://lists.apache.org/thread.html/36529c7df113e81ace51301175528120884af73b78edd40764a88cf8@%3Cdev.tika.apache.org%3E > Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object > truncated by 465479) > > > Key: TIKA-2869 > URL: https://issues.apache.org/jira/browse/TIKA-2869 > Project: Tika > Issue Type: Bug > Components: app, cli, parser >Affects Versions: 1.20 > Environment: Windows 10 (1809 - 17763.437) > Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode) >Reporter: Edans Sandes >Assignee: Tim Allison >Priority: Major > Fix For: 1.21 > > Attachments: 0001.127_342_5_7955.pdf > > > I could convert the attached pdf using tika-app-1.19.1.jar, but now, in > version tika-app-1.20.jar, it stopped working. > {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: > Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149) > Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object > truncated by 465479 > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) > at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479 > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at org.bouncycastle.util.io.Streams.readFully(Unknown Source) > at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.FilterInputStream.read(Unknown Source) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) > ... 10 more > > > {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} > 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} > {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be > processed.}} > {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}} > {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM > or
[jira] [Commented] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
[ https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837434#comment-16837434 ] Tim Allison commented on TIKA-2869: --- Fix made on master wasn't merged in {{branch_1x}}: 10d380ae > Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object > truncated by 465479) > > > Key: TIKA-2869 > URL: https://issues.apache.org/jira/browse/TIKA-2869 > Project: Tika > Issue Type: Bug > Components: app, cli, parser >Affects Versions: 1.20 > Environment: Windows 10 (1809 - 17763.437) > Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode) >Reporter: Edans Sandes >Priority: Major > Attachments: 0001.127_342_5_7955.pdf > > > I could convert the attached pdf using tika-app-1.19.1.jar, but now, in > version tika-app-1.20.jar, it stopped working. > {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: > Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149) > Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object > truncated by 465479 > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) > at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479 > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at org.bouncycastle.util.io.Streams.readFully(Unknown Source) > at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.FilterInputStream.read(Unknown Source) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) > ... 10 more > > > {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} > 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} > {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be > processed.}} > {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}} > {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} > {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} > {{Please provide the jar on your classpath to parse sqlite files.}} > {{See tika-parsers/pom.xml for the correct version.}}{{ encoding="UTF-8"?>http://www.w3.org/1999/xhtml";>}} > {{}} > {{}}{{...CORRECT XML > OUTPUT...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
[ https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2869. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 1.21 > Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object > truncated by 465479) > > > Key: TIKA-2869 > URL: https://issues.apache.org/jira/browse/TIKA-2869 > Project: Tika > Issue Type: Bug > Components: app, cli, parser >Affects Versions: 1.20 > Environment: Windows 10 (1809 - 17763.437) > Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode) >Reporter: Edans Sandes >Assignee: Tim Allison >Priority: Major > Fix For: 1.21 > > Attachments: 0001.127_342_5_7955.pdf > > > I could convert the attached pdf using tika-app-1.19.1.jar, but now, in > version tika-app-1.20.jar, it stopped working. > {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: > Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149) > Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object > truncated by 465479 > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) > at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479 > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at org.bouncycastle.util.io.Streams.readFully(Unknown Source) > at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.FilterInputStream.read(Unknown Source) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) > ... 10 more > > > {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} > 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} > {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be > processed.}} > {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}} > {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} > {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} > {{Please provide the jar on your classpath to parse sqlite files.}} > {{See tika-parsers/pom.xml for the correct version.}}{{ encoding="UTF-8"?>http://www.w3.org/1999/xhtml";>}} > {{}} > {{}}{{...CORRECT XML > OUTPUT...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
[ https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837420#comment-16837420 ] Tim Allison commented on TIKA-2869: --- I'm able to reproduce this in our 1.x branch but not in our master branch. I'll take a look. Thank you for opening this issue and sharing a triggering file! > Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object > truncated by 465479) > > > Key: TIKA-2869 > URL: https://issues.apache.org/jira/browse/TIKA-2869 > Project: Tika > Issue Type: Bug > Components: app, cli, parser >Affects Versions: 1.20 > Environment: Windows 10 (1809 - 17763.437) > Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode) >Reporter: Edans Sandes >Priority: Major > Attachments: 0001.127_342_5_7955.pdf > > > I could convert the attached pdf using tika-app-1.19.1.jar, but now, in > version tika-app-1.20.jar, it stopped working. > {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: > Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149) > Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object > truncated by 465479 > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437) > at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) > at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 5 more > Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479 > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at org.bouncycastle.util.io.Streams.readFully(Unknown Source) > at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.FilterInputStream.read(Unknown Source) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) > ... 10 more > > > {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} > 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} > {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be > processed.}} > {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}} > {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM > org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem}} &g
[jira] [Updated] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
[ https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edans Sandes updated TIKA-2869: --- Description: I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version tika-app-1.20.jar, it stopped working. {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} {{mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}} {{for optional dependencies.}}{{mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}} {{Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} {{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}} {{ at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)}} {{ at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)}} {{ at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)}} {{Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object truncated by 465479}} {{ at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)}} {{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)}} {{ at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)}} {{ at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)}} {{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)}} {{ at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} {{ ... 5 more}} {{Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at java.io.BufferedInputStream.read1(Unknown Source)}} {{ at java.io.BufferedInputStream.read(Unknown Source)}} {{ at org.bouncycastle.util.io.Streams.readFully(Unknown Source)}} {{ at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)}} {{ at java.io.BufferedInputStream.fill(Unknown Source)}} {{ at java.io.BufferedInputStream.read(Unknown Source)}} {{ at java.io.FilterInputStream.read(Unknown Source)}} {{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)}} {{ ... 10 more}} {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}} {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}}{{http://www.w3.org/1999/xhtml";>}} {{}} {{}}{{...CORRECT XML OUTPUT...}} was: I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version tika-app-1.20.jar, it stopped working. {{java -jar {color:#FF}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} {{}}{{mai 10, 2019 11:20:40 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}} {{for optional dependencies.}}{{mai 10, 2019 11:20:40 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}} {{Exception in thread "main" org.ap
[jira] [Updated] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
[ https://issues.apache.org/jira/browse/TIKA-2869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edans Sandes updated TIKA-2869: --- Description: I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version tika-app-1.20.jar, it stopped working. {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149) Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object truncated by 465479 at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479 at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at org.bouncycastle.util.io.Streams.readFully(Unknown Source) at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59) ... 10 more {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]}} {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}}{{http://www.w3.org/1999/xhtml";>}} {{}} {{}}{{...CORRECT XML OUTPUT...}} was: I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version tika-app-1.20.jar, it stopped working. {{java -jar {color:#ff}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} {{mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}} {{for optional dependencies.}}{{mai 10, 2019 11:36:23 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}} {{Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e}} {{ at org.apache.tika.parser.CompositePar
[jira] [Created] (TIKA-2869) Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479)
Edans Sandes created TIKA-2869: -- Summary: Can't parse pdf in version 1.20 - Pkcs7Parser (DEF length 465542 object truncated by 465479) Key: TIKA-2869 URL: https://issues.apache.org/jira/browse/TIKA-2869 Project: Tika Issue Type: Bug Components: app, cli, parser Affects Versions: 1.20 Environment: Windows 10 (1809 - 17763.437) Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) Client VM (build 25.121-b13, mixed mode) Reporter: Edans Sandes Attachments: 0001.127_342_5_7955.pdf I could convert the attached pdf using tika-app-1.19.1.jar, but now, in version tika-app-1.20.jar, it stopped working. {{java -jar {color:#FF}tika-app-1.20.jar{color} 0001.127_342_5_7955.pdf}} {{}}{{mai 10, 2019 11:20:40 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}} {{for optional dependencies.}}{{mai 10, 2019 11:20:40 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}} {{Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.crypto.Pkcs7Parser@1c43f4e}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} {{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}} {{ at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209)}} {{ at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496)}} {{ at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)}} {{Caused by: org.apache.tika.io.TaggedIOException: DEF length 465542 object truncated by 465479}} {{ at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)}} {{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:63)}} {{ at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:437)}} {{ at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)}} {{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)}} {{ at org.apache.tika.parser.crypto.Pkcs7Parser.parse(Pkcs7Parser.java:86)}} {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}} {{ ... 5 more}} {{Caused by: java.io.EOFException: DEF length 465542 object truncated by 465479}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at org.bouncycastle.asn1.DefiniteLengthInputStream.read(Unknown Source)}} {{ at java.io.BufferedInputStream.read1(Unknown Source)}} {{ at java.io.BufferedInputStream.read(Unknown Source)}} {{ at org.bouncycastle.util.io.Streams.readFully(Unknown Source)}} {{ at org.bouncycastle.cms.CMSTypedStream$FullReaderStream.read(Unknown Source)}} {{ at java.io.BufferedInputStream.fill(Unknown Source)}} {{ at java.io.BufferedInputStream.read(Unknown Source)}} {{ at java.io.FilterInputStream.read(Unknown Source)}} {{ at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:59)}} {{ ... 10 more}} {{java -jar {color:#14892c}tika-app-1.19.1.jar{color} 0001.127_342_5_7955.pdf}}{{mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: J2KImageReader not loaded. JPEG2000 files will not be processed.}} {{See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io}} {{for optional dependencies.}}{{mai 10, 2019 11:26:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem}} {{ADVERT╩NCIA: org.xerial's sqlite-jdbc is not loaded.}} {{Please provide the jar on your classpath to parse sqlite files.}} {{See tika-parsers/pom.xml for the correct version.}}{{http://www.w3.org/1999/xhtml";>}} {{}} {{}}{{...CORRECT XML OUTPUT...}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2855) pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable
[ https://issues.apache.org/jira/browse/TIKA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2855. --- Resolution: Duplicate Thank you! > pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable > - > > Key: TIKA-2855 > URL: https://issues.apache.org/jira/browse/TIKA-2855 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.19.1 >Reporter: Abhijit Rajwade >Priority: Major > > As per Sonatype Nexus Auditor, pdfbox versions upto 2.0.14 are vulnerable to > "CVE-2019-0228: possible XML External Entity (XXE) attack". > Recommended fix is to upgrade to pdfbox version 2.0.15 > Refer following pdfbox issue > https://issues.apache.org/jira/browse/PDFBOX-4505 > which is fixed on version 2.0.15 > Can you please upgrade Apache Tika to use pdfbox 2.0.15? > Following are details from the Sonatype Nexus scan report > Issue: CVE-2019-0228 > Severity: Sonatype CVSS 3.0: 7.3 > Weakness: Sonatype CWE: 611 > Source: National Vulnerability Database > Categories: Data > Description from CVE: apache pdfbox - XML External Entity (XXE) > Root Cause: pdfbox-2.0.12.jar : ( , 2.0.15) > Advisories: > Project: https://github.com/apache/pdfbox-docs/commit/b7869c3e4c62c5d... > Project: https://issues.apache.org/jira/browse/PDFBOX-4505 > Third Party: https://bugzilla.redhat.com/show_bug.cgi?id=1699740 > CVSS Details: > Sonatype CVSS 3.0: 7.3 > CVSS Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TIKA-2855) pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable
Abhijit Rajwade created TIKA-2855: - Summary: pdfbox version used by both Apache Tika 1.19.1 and 1.20 is vulnerable Key: TIKA-2855 URL: https://issues.apache.org/jira/browse/TIKA-2855 Project: Tika Issue Type: Bug Components: core Affects Versions: 1.19.1 Reporter: Abhijit Rajwade As per Sonatype Nexus Auditor, pdfbox versions upto 2.0.14 are vulnerable to "CVE-2019-0228: possible XML External Entity (XXE) attack". Recommended fix is to upgrade to pdfbox version 2.0.15 Refer following pdfbox issue https://issues.apache.org/jira/browse/PDFBOX-4505 which is fixed on version 2.0.15 Can you please upgrade Apache Tika to use pdfbox 2.0.15? Following are details from the Sonatype Nexus scan report Issue: CVE-2019-0228 Severity: Sonatype CVSS 3.0: 7.3 Weakness: Sonatype CWE: 611 Source: National Vulnerability Database Categories: Data Description from CVE: apache pdfbox - XML External Entity (XXE) Root Cause: pdfbox-2.0.12.jar : ( , 2.0.15) Advisories: Project: https://github.com/apache/pdfbox-docs/commit/b7869c3e4c62c5d... Project: https://issues.apache.org/jira/browse/PDFBOX-4505 Third Party: https://bugzilla.redhat.com/show_bug.cgi?id=1699740 CVSS Details: Sonatype CVSS 3.0: 7.3 CVSS Vector: CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Tika 1.20 Candidate #1
Thank you, Oleg and Ken! On Sat, Dec 22, 2018 at 6:57 AM Oleg Tikhonov wrote: > > *stuff > > On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov > > All basic staff passed. > > +1. > > Oleg > > > > On Fri, Dec 21, 2018, 22:02 Ken Krugler > wrote: > > > >> Hi Tim, > >> > >> Thanks for rolling the release. > >> > >> Built & validated on Mac OS X 10.12 > >> > >> Updated flink-crawler, all tests pass. > >> > >> So here’s my +1 > >> > >> — Ken > >> > >> > >> > On Dec 17, 2018, at 6:14 PM, Tim Allison wrote: > >> > > >> > A candidate for the Tika 1.20 release is available at: > >> > > >> > https://dist.apache.org/repos/dist/dev/tika/ > >> > > >> > The release candidate is a zip archive of the sources in: > >> > https://github.com/apache/tika/tree/1.20-rc1/ > >> > > >> > The SHA-512 checksum of the archive is > >> > > >> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e. > >> > > >> > In addition, a staged maven repository is available here: > >> > > >> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika > >> > > >> > > >> > Please vote on releasing this package as Apache Tika 1.20. > >> > > >> > The vote is open for the next 72 hours and passes if a majority of at > >> > least three +1 Tika PMC votes are cast. > >> > > >> > [ ] +1 Release this package as Apache Tika 1.20 > >> > [ ] -1 Do not release this package because... > >> > > >> > Here's my +1. > >> > > >> > Cheers, > >> > > >> > Tim > >> > >> -- > >> Ken Krugler > >> +1 530-210-6378 > >> http://www.scaleunlimited.com > >> Custom big data solutions & training > >> Flink, Solr, Hadoop, Cascading & Cassandra > >> > >>
[ANNOUNCE] Apache Tika 1.20 released
The Apache Tika project is pleased to announce the release of Apache Tika 1.20. The release contents have been pushed out to the main Apache release site and to the Maven Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.20 contains a number of improvements and bug fixes. Details can be found in the changes file: https://www.apache.org/dist/tika/CHANGES-1.20.txt Apache Tika is available on the download page: https://tika.apache.org/download.html Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: https://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found: https://www.apache.org/dist/tika/KEYS For more information on Apache Tika, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community
[RESULT][VOTE] Release Apache Tika 1.20 Candidate #1
The vote has passed: +1 from Oleg Tikhonov Ken Krugler Tim Allison no -1 Cheers, Tim On Sat, Dec 22, 2018 at 6:57 AM Oleg Tikhonov wrote: > > *stuff > > On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov > > All basic staff passed. > > +1. > > Oleg > > > > On Fri, Dec 21, 2018, 22:02 Ken Krugler > wrote: > > > >> Hi Tim, > >> > >> Thanks for rolling the release. > >> > >> Built & validated on Mac OS X 10.12 > >> > >> Updated flink-crawler, all tests pass. > >> > >> So here’s my +1 > >> > >> — Ken > >> > >> > >> > On Dec 17, 2018, at 6:14 PM, Tim Allison wrote: > >> > > >> > A candidate for the Tika 1.20 release is available at: > >> > > >> > https://dist.apache.org/repos/dist/dev/tika/ > >> > > >> > The release candidate is a zip archive of the sources in: > >> > https://github.com/apache/tika/tree/1.20-rc1/ > >> > > >> > The SHA-512 checksum of the archive is > >> > > >> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e. > >> > > >> > In addition, a staged maven repository is available here: > >> > > >> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika > >> > > >> > > >> > Please vote on releasing this package as Apache Tika 1.20. > >> > > >> > The vote is open for the next 72 hours and passes if a majority of at > >> > least three +1 Tika PMC votes are cast. > >> > > >> > [ ] +1 Release this package as Apache Tika 1.20 > >> > [ ] -1 Do not release this package because... > >> > > >> > Here's my +1. > >> > > >> > Cheers, > >> > > >> > Tim > >> > >> -- > >> Ken Krugler > >> +1 530-210-6378 > >> http://www.scaleunlimited.com > >> Custom big data solutions & training > >> Flink, Solr, Hadoop, Cascading & Cassandra > >> > >>
Re: [VOTE] Release Apache Tika 1.20 Candidate #1
*stuff On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov All basic staff passed. > +1. > Oleg > > On Fri, Dec 21, 2018, 22:02 Ken Krugler wrote: > >> Hi Tim, >> >> Thanks for rolling the release. >> >> Built & validated on Mac OS X 10.12 >> >> Updated flink-crawler, all tests pass. >> >> So here’s my +1 >> >> — Ken >> >> >> > On Dec 17, 2018, at 6:14 PM, Tim Allison wrote: >> > >> > A candidate for the Tika 1.20 release is available at: >> > >> > https://dist.apache.org/repos/dist/dev/tika/ >> > >> > The release candidate is a zip archive of the sources in: >> > https://github.com/apache/tika/tree/1.20-rc1/ >> > >> > The SHA-512 checksum of the archive is >> > >> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e. >> > >> > In addition, a staged maven repository is available here: >> > >> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika >> > >> > >> > Please vote on releasing this package as Apache Tika 1.20. >> > >> > The vote is open for the next 72 hours and passes if a majority of at >> > least three +1 Tika PMC votes are cast. >> > >> > [ ] +1 Release this package as Apache Tika 1.20 >> > [ ] -1 Do not release this package because... >> > >> > Here's my +1. >> > >> > Cheers, >> > >> > Tim >> >> -- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> Custom big data solutions & training >> Flink, Solr, Hadoop, Cascading & Cassandra >> >>
Re: [VOTE] Release Apache Tika 1.20 Candidate #1
All basic staff passed. +1. Oleg On Fri, Dec 21, 2018, 22:02 Ken Krugler Hi Tim, > > Thanks for rolling the release. > > Built & validated on Mac OS X 10.12 > > Updated flink-crawler, all tests pass. > > So here’s my +1 > > — Ken > > > > On Dec 17, 2018, at 6:14 PM, Tim Allison wrote: > > > > A candidate for the Tika 1.20 release is available at: > > > > https://dist.apache.org/repos/dist/dev/tika/ > > > > The release candidate is a zip archive of the sources in: > > https://github.com/apache/tika/tree/1.20-rc1/ > > > > The SHA-512 checksum of the archive is > > > add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e. > > > > In addition, a staged maven repository is available here: > > > https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika > > > > > > Please vote on releasing this package as Apache Tika 1.20. > > > > The vote is open for the next 72 hours and passes if a majority of at > > least three +1 Tika PMC votes are cast. > > > > [ ] +1 Release this package as Apache Tika 1.20 > > [ ] -1 Do not release this package because... > > > > Here's my +1. > > > > Cheers, > > > > Tim > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra > >
Re: [VOTE] Release Apache Tika 1.20 Candidate #1
Hi Tim, Thanks for rolling the release. Built & validated on Mac OS X 10.12 Updated flink-crawler, all tests pass. So here’s my +1 — Ken > On Dec 17, 2018, at 6:14 PM, Tim Allison wrote: > > A candidate for the Tika 1.20 release is available at: > > https://dist.apache.org/repos/dist/dev/tika/ > > The release candidate is a zip archive of the sources in: > https://github.com/apache/tika/tree/1.20-rc1/ > > The SHA-512 checksum of the archive is > add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e. > > In addition, a staged maven repository is available here: > https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika > > > Please vote on releasing this package as Apache Tika 1.20. > > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.20 > [ ] -1 Do not release this package because... > > Here's my +1. > > Cheers, > > Tim -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra
Re: 1.20?
Reports on mp4s, junrar, msaccess and a random subset of the regression corpus are available here: http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz On Thu, Dec 13, 2018 at 5:34 PM Tim Allison wrote: > > Let me actually take a look before answering. Sorry! > > On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: >> >> Thank you for reading the reports!!! >> >> The files are very likely broken. I can take a look. The change was >> probably because of an "upgrade" to junrar. Should I revert to the >> version we used in 1.19.1? >> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif >> wrote: >> > >> > Hi Tim, >> > >> > Reading your great reports, I also saw some new exceptions with RAR files >> > in likely broken folder, but seems tika was able to extract some text from >> > them before. Do you know if those files are really broken and why tika >> > extracted text from them before? >> > >> > Thank you, >> > Luis >> > >> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison >> > escreveu: >> > >> > > Reports are here: >> > > >> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip >> > > >> > > I'm going to revert the mp4 parser, and commit the few dependency >> > > upgrades I ran. >> > > >> > > The _major_ difference in content for ppt is explained by the >> > > duplication of header/footer info. To confirm this, note that the >> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are >> > > identical for nearly all ppt->ppt, but there are far more tokens in >> > > "num_tokens_a" vs "num_tokens_b". >> > > >> > > I also see that we're losing content in x-java and x-groovy, etc., but >> > > that's because we're now suppressing the style markup that our parser >> > > was (incorrectly, IMHO, inserting) -- check the values in >> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | >> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | >> > > weight: 3 | family: 2 >> > > >> > > In short, I think we're good to go. Will roll rc1 later today or >> > > (more likely) tomorrow unless there are objections. >> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: >> > > > >> > > > Any blockers on 1.20? I'm going to kick off the regression tests >> > > shortly. >> > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: >> > > > > >> > > > > Hi, >> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison >> > > > > wrote: >> > > > > >> > > > > > Dave, >> > > > > > Should I try to get the Docker plugin working again? >> > > > > > >> > > > > >> > > > > That would be great. I think I may have went down the wrong path >> > > building >> > > > > an image at package time, as there doesn't seem to be an easy way to >> > > > > publish it as an Apache labelled org on Dockerhub unless it builds >> > > > > from >> > > > > source. >> > > > > >> > > > > I have some time over the weekend, so could update to where I got to >> > > and >> > > > > see what you think. >> > > > > >> > > > > Cheers, >> > > > > Dave >> > >
[VOTE] Release Apache Tika 1.20 Candidate #1
A candidate for the Tika 1.20 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.20-rc1/ The SHA-512 checksum of the archive is add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika Please vote on releasing this package as Apache Tika 1.20. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.20 [ ] -1 Do not release this package because... Here's my +1. Cheers, Tim
Re: 1.20?
Thank you, again, Luís Filipe Nassif! There's no point in having reports unless we pay attention to them :P. I reverted junrar to where it was in 1.19.1. I also reverted jackcess based on the reports. All, On the theory that it isn't a great idea to push to production on a Friday. I'm going to let the recent changes rest over the weekend. I'll rerun some tests on a subset of the regression corpus on Monday and then roll rc1. If anyone wants to kick the tires on the recent version changes, including parsers that depend on the upgraded guava, that'd be great! Onward! Cheers, Tim On Thu, Dec 13, 2018 at 5:34 PM Tim Allison wrote: > > Let me actually take a look before answering. Sorry! > > On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: >> >> Thank you for reading the reports!!! >> >> The files are very likely broken. I can take a look. The change was >> probably because of an "upgrade" to junrar. Should I revert to the >> version we used in 1.19.1? >> On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif >> wrote: >> > >> > Hi Tim, >> > >> > Reading your great reports, I also saw some new exceptions with RAR files >> > in likely broken folder, but seems tika was able to extract some text from >> > them before. Do you know if those files are really broken and why tika >> > extracted text from them before? >> > >> > Thank you, >> > Luis >> > >> > Em qui, 13 de dez de 2018 às 13:02, Tim Allison >> > escreveu: >> > >> > > Reports are here: >> > > >> > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip >> > > >> > > I'm going to revert the mp4 parser, and commit the few dependency >> > > upgrades I ran. >> > > >> > > The _major_ difference in content for ppt is explained by the >> > > duplication of header/footer info. To confirm this, note that the >> > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are >> > > identical for nearly all ppt->ppt, but there are far more tokens in >> > > "num_tokens_a" vs "num_tokens_b". >> > > >> > > I also see that we're losing content in x-java and x-groovy, etc., but >> > > that's because we're now suppressing the style markup that our parser >> > > was (incorrectly, IMHO, inserting) -- check the values in >> > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | >> > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | >> > > weight: 3 | family: 2 >> > > >> > > In short, I think we're good to go. Will roll rc1 later today or >> > > (more likely) tomorrow unless there are objections. >> > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: >> > > > >> > > > Any blockers on 1.20? I'm going to kick off the regression tests >> > > shortly. >> > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: >> > > > > >> > > > > Hi, >> > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison >> > > > > wrote: >> > > > > >> > > > > > Dave, >> > > > > > Should I try to get the Docker plugin working again? >> > > > > > >> > > > > >> > > > > That would be great. I think I may have went down the wrong path >> > > building >> > > > > an image at package time, as there doesn't seem to be an easy way to >> > > > > publish it as an Apache labelled org on Dockerhub unless it builds >> > > > > from >> > > > > source. >> > > > > >> > > > > I have some time over the weekend, so could update to where I got to >> > > and >> > > > > see what you think. >> > > > > >> > > > > Cheers, >> > > > > Dave >> > >
Re: 1.20?
Let me actually take a look before answering. Sorry! On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: > Thank you for reading the reports!!! > > The files are very likely broken. I can take a look. The change was > probably because of an "upgrade" to junrar. Should I revert to the > version we used in 1.19.1? > On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif > wrote: > > > > Hi Tim, > > > > Reading your great reports, I also saw some new exceptions with RAR files > > in likely broken folder, but seems tika was able to extract some text > from > > them before. Do you know if those files are really broken and why tika > > extracted text from them before? > > > > Thank you, > > Luis > > > > Em qui, 13 de dez de 2018 às 13:02, Tim Allison > > escreveu: > > > > > Reports are here: > > > > > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > > > > > I'm going to revert the mp4 parser, and commit the few dependency > > > upgrades I ran. > > > > > > The _major_ difference in content for ppt is explained by the > > > duplication of header/footer info. To confirm this, note that the > > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > > > identical for nearly all ppt->ppt, but there are far more tokens in > > > "num_tokens_a" vs "num_tokens_b". > > > > > > I also see that we're losing content in x-java and x-groovy, etc., but > > > that's because we're now suppressing the style markup that our parser > > > was (incorrectly, IMHO, inserting) -- check the values in > > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > > > weight: 3 | family: 2 > > > > > > In short, I think we're good to go. Will roll rc1 later today or > > > (more likely) tomorrow unless there are objections. > > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison > wrote: > > > > > > > > Any blockers on 1.20? I'm going to kick off the regression tests > > > shortly. > > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > > > > > > > Hi, > > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison > wrote: > > > > > > > > > > > Dave, > > > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > > > > > > > That would be great. I think I may have went down the wrong path > > > building > > > > > an image at package time, as there doesn't seem to be an easy way > to > > > > > publish it as an Apache labelled org on Dockerhub unless it builds > from > > > > > source. > > > > > > > > > > I have some time over the weekend, so could update to where I got > to > > > and > > > > > see what you think. > > > > > > > > > > Cheers, > > > > > Dave > > > >
Re: 1.20?
Thank you for reading the reports!!! The files are very likely broken. I can take a look. The change was probably because of an "upgrade" to junrar. Should I revert to the version we used in 1.19.1? On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif wrote: > > Hi Tim, > > Reading your great reports, I also saw some new exceptions with RAR files > in likely broken folder, but seems tika was able to extract some text from > them before. Do you know if those files are really broken and why tika > extracted text from them before? > > Thank you, > Luis > > Em qui, 13 de dez de 2018 às 13:02, Tim Allison > escreveu: > > > Reports are here: > > > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > > > I'm going to revert the mp4 parser, and commit the few dependency > > upgrades I ran. > > > > The _major_ difference in content for ppt is explained by the > > duplication of header/footer info. To confirm this, note that the > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > > identical for nearly all ppt->ppt, but there are far more tokens in > > "num_tokens_a" vs "num_tokens_b". > > > > I also see that we're losing content in x-java and x-groovy, etc., but > > that's because we're now suppressing the style markup that our parser > > was (incorrectly, IMHO, inserting) -- check the values in > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > > weight: 3 | family: 2 > > > > In short, I think we're good to go. Will roll rc1 later today or > > (more likely) tomorrow unless there are objections. > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > > > > > Any blockers on 1.20? I'm going to kick off the regression tests > > shortly. > > > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > > > > > Hi, > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > > > > > > > Dave, > > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > > > > That would be great. I think I may have went down the wrong path > > building > > > > an image at package time, as there doesn't seem to be an easy way to > > > > publish it as an Apache labelled org on Dockerhub unless it builds from > > > > source. > > > > > > > > I have some time over the weekend, so could update to where I got to > > and > > > > see what you think. > > > > > > > > Cheers, > > > > Dave > >
Re: 1.20?
Hi Tim, Reading your great reports, I also saw some new exceptions with RAR files in likely broken folder, but seems tika was able to extract some text from them before. Do you know if those files are really broken and why tika extracted text from them before? Thank you, Luis Em qui, 13 de dez de 2018 às 13:02, Tim Allison escreveu: > Reports are here: > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > I'm going to revert the mp4 parser, and commit the few dependency > upgrades I ran. > > The _major_ difference in content for ppt is explained by the > duplication of header/footer info. To confirm this, note that the > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > identical for nearly all ppt->ppt, but there are far more tokens in > "num_tokens_a" vs "num_tokens_b". > > I also see that we're losing content in x-java and x-groovy, etc., but > that's because we're now suppressing the style markup that our parser > was (incorrectly, IMHO, inserting) -- check the values in > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > weight: 3 | family: 2 > > In short, I think we're good to go. Will roll rc1 later today or > (more likely) tomorrow unless there are objections. > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > > > Any blockers on 1.20? I'm going to kick off the regression tests > shortly. > > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > > > Hi, > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > > > > > Dave, > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > That would be great. I think I may have went down the wrong path > building > > > an image at package time, as there doesn't seem to be an easy way to > > > publish it as an Apache labelled org on Dockerhub unless it builds from > > > source. > > > > > > I have some time over the weekend, so could update to where I got to > and > > > see what you think. > > > > > > Cheers, > > > Dave >
Re: 1.20?
Roll forward! Yay! From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Thursday, December 13, 2018 at 7:02 AM To: "dev@tika.apache.org" Subject: Re: 1.20? Reports are here: http://162.242.228.174/reports/tika_1_20-pre-rc1.zip I'm going to revert the mp4 parser, and commit the few dependency upgrades I ran. The _major_ difference in content for ppt is explained by the duplication of header/footer info. To confirm this, note that the values for "num_unique_tokens_a" and "num_unique_tokens_b" are identical for nearly all ppt->ppt, but there are far more tokens in "num_tokens_a" vs "num_tokens_b". I also see that we're losing content in x-java and x-groovy, etc., but that's because we're now suppressing the style markup that our parser was (incorrectly, IMHO, inserting) -- check the values in "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | weight: 3 | family: 2 In short, I think we're good to go. Will roll rc1 later today or (more likely) tomorrow unless there are objections. On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: Any blockers on 1.20? I'm going to kick off the regression tests shortly. On Fri, Nov 30, 2018 at 7:39 PM wrote: > > Hi, > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > Dave, > > Should I try to get the Docker plugin working again? > > > > That would be great. I think I may have went down the wrong path building > an image at package time, as there doesn't seem to be an easy way to > publish it as an Apache labelled org on Dockerhub unless it builds from > source. > > I have some time over the weekend, so could update to where I got to and > see what you think. > > Cheers, > Dave
Re: 1.20?
Reports are here: http://162.242.228.174/reports/tika_1_20-pre-rc1.zip I'm going to revert the mp4 parser, and commit the few dependency upgrades I ran. The _major_ difference in content for ppt is explained by the duplication of header/footer info. To confirm this, note that the values for "num_unique_tokens_a" and "num_unique_tokens_b" are identical for nearly all ppt->ppt, but there are far more tokens in "num_tokens_a" vs "num_tokens_b". I also see that we're losing content in x-java and x-groovy, etc., but that's because we're now suppressing the style markup that our parser was (incorrectly, IMHO, inserting) -- check the values in "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | weight: 3 | family: 2 In short, I think we're good to go. Will roll rc1 later today or (more likely) tomorrow unless there are objections. On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > Any blockers on 1.20? I'm going to kick off the regression tests shortly. > On Fri, Nov 30, 2018 at 7:39 PM wrote: > > > > Hi, > > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > > > Dave, > > > Should I try to get the Docker plugin working again? > > > > > > > That would be great. I think I may have went down the wrong path building > > an image at package time, as there doesn't seem to be an easy way to > > publish it as an Apache labelled org on Dockerhub unless it builds from > > source. > > > > I have some time over the weekend, so could update to where I got to and > > see what you think. > > > > Cheers, > > Dave
Re: 1.20?
Any blockers on 1.20? I'm going to kick off the regression tests shortly. On Fri, Nov 30, 2018 at 7:39 PM wrote: > > Hi, > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > Dave, > > Should I try to get the Docker plugin working again? > > > > That would be great. I think I may have went down the wrong path building > an image at package time, as there doesn't seem to be an easy way to > publish it as an Apache labelled org on Dockerhub unless it builds from > source. > > I have some time over the weekend, so could update to where I got to and > see what you think. > > Cheers, > Dave
Re: 1.20?
Hi, On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > Dave, > Should I try to get the Docker plugin working again? > That would be great. I think I may have went down the wrong path building an image at package time, as there doesn't seem to be an easy way to publish it as an Apache labelled org on Dockerhub unless it builds from source. I have some time over the weekend, so could update to where I got to and see what you think. Cheers, Dave
Re: 1.20?
+1 would be nice to get the recent ENVI work released as well folks. On 2018/11/20 23:04:29, Tim Allison wrote: > All, >POI 4.0.1 will be out shortly with some important bug fixes. What would > you all think of targeting 1st/2nd week of December for 1.20? > > Cheers, > Tim >
Re: 1.20?
Dave, Should I try to get the Docker plugin working again? On Tue, Nov 20, 2018 at 6:21 PM Chris Mattmann wrote: > Love it and I can align tika-python with that too ☺ > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" > Date: Tuesday, November 20, 2018 at 3:04 PM > To: "dev@tika.apache.org" > Subject: 1.20? > > > > All, > >POI 4.0.1 will be out shortly with some important bug fixes. What would > > you all think of targeting 1st/2nd week of December for 1.20? > > > > Cheers, > > Tim > > > >
Re: 1.20?
Love it and I can align tika-python with that too ☺ From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Tuesday, November 20, 2018 at 3:04 PM To: "dev@tika.apache.org" Subject: 1.20? All, POI 4.0.1 will be out shortly with some important bug fixes. What would you all think of targeting 1st/2nd week of December for 1.20? Cheers, Tim
1.20?
All, POI 4.0.1 will be out shortly with some important bug fixes. What would you all think of targeting 1st/2nd week of December for 1.20? Cheers, Tim