How about this: https://corpora.tika.apache.org/base/reports/tika-2.8.1-pre-rc1-v3.tgz
On Thu, Jul 20, 2023 at 11:00 AM Tim Allison <[email protected]> wrote: > How about these: > https://corpora.tika.apache.org/base/reports/tika-2.8.1-pre-rc1-v2.tgz > > That adds the "file_name", which is either the container file name or, in > the case of an embedded file, Tika's best guess as to what the container > file named the embedded file. The "file_name" is problematic because it is > "user defined data" and can be messy or malicious. We should include the > embedded file id path, which is just numbers and slashes, but that'll take > a bit more work. > > So, let me know if this helps any... > > On Thu, Jul 20, 2023 at 9:32 AM Tim Allison <[email protected]> wrote: > >> Y, again, those are embedded files. I'll add this report to the ticket >> as well to include embedded resource path. Thank you! >> >> On Wed, Jul 19, 2023 at 11:56 PM Tilman Hausherr <[email protected]> >> wrote: >> >>> On 19.07.2023 19:19, Tim Allison wrote: >>> > Results are here: >>> > https://corpora.tika.apache.org/base/reports/tika-2.8.1-pre-rc1.tgz >>> >>> >>> >>> govdocs1/974/974098.ppt >>> >>> appears twice in the content_diffs_no_exceptions.xlsx file but with >>> different content?! >>> >>> >>> govdocs1/974/974098.ppt 304128 application/msword >>> application/vnd.ms-excel 44 567 66 1421 eng >>> 48 eng 39 -9 the: >>> 6 | of: 5 | standard: 4 | 90: 3 | energy: 3 | 75: 2 | a: 2 | and: 2 | >>> by: 2 | criteria: 2 330: 42 | 21008: 12 | 24371: 12 | 29977: 12 | >>> energy: 9 | lcc: 8 | 1: 6 | 10: 6 | 11: 6 | 12: 6 BASIC_LATIN: 437 >>> BASIC_LATIN: 7140 the: 6 | of: 5 | standard: 4 | a: 2 | and: 2 | >>> by: 2 >>> | criteria: 2 | technical: 2 | 1980: 1 | adopted: 1 330: 42 | 21008: >>> 12 >>> | 24371: 12 | 29977: 12 | lcc: 8 | 1: 6 | 10: 6 | 11: 6 | 12: 6 | 13: 6 >>> the: 6 | of: 5 | standard: 4 | a: 2 | and: 2 | by: 2 | criteria: 2 | >>> technical: 2 | 1980: 1 | adopted: 1 330: 42 | 21008: 12 | 24371: 12 >>> | >>> 29977: 12 | lcc: 8 | 1: 6 | 10: 6 | 11: 6 | 12: 6 | 13: 6 0,013 >>> 0,012 >>> >>> govdocs1/974/974098.ppt 304128 application/vnd.ms-excel >>> application/vnd.ms-excel 567 148 1421 262 eng >>> 39 eng 21 -18 >>> 330: 42 | 21008: 12 | 24371: 12 | 29977: 12 | energy: 9 | lcc: 8 | 1: 6 >>> | 10: 6 | 11: 6 | 12: 6 1: 10 | 30: 7 | 25: 5 | 218.06175: 4 | >>> 436.1235: 4 | 654.18525: 4 | 872.247: 4 | energy: 4 | national: 4 | 50: >>> 3 BASIC_LATIN: 7140 BASIC_LATIN: 1717 330: 42 | 21008: >>> 12 | 24371: 12 >>> | 29977: 12 | lcc: 8 | 31: 6 | 32: 6 | 33: 6 | 34: 6 | 35: 6 >>> 218.06175: >>> 4 | 436.1235: 4 | 654.18525: 4 | 872.247: 4 | national: 4 | 90.1: 3 | >>> ashrae: 3 | 1265.469: 2 | 141.157: 2 | 149.503: 2 330: 42 | 21008: >>> 12 | >>> 24371: 12 | 29977: 12 | lcc: 8 | 31: 6 | 32: 6 | 33: 6 | 34: 6 | 35: 6 >>> 1: 4 | 218.06175: 4 | 436.1235: 4 | 654.18525: 4 | 872.247: 4 | >>> national: 4 | 90.1: 3 | ashrae: 3 | 1265.469: 2 | 141.157: 2 0,092 >>> 0,096 >>> >>> >>>
