[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data
[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706400#comment-17706400 ] Andrew Jackson commented on TIKA-3992: -- Sounds interesting! Just wanted to note that Siegfried (and DROID/etc) signatures often require end-of-file matches as well as beginning-of-file, so if you do truncate the files you'll get the best results by chopping out the middle. I'd imagine the first and last few KB should do it. > Add common missing mimes based on Common Crawl data > --- > > Key: TIKA-3992 > URL: https://issues.apache.org/jira/browse/TIKA-3992 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as > detected by Tika. It would be useful to extract those (even if truncated) > and run 'file' and 'siegfried' against those file types that are unknown to > Tika. We can prioritize the most common file formats as identified by file > and siegfried for addition to our mime-types.xml. > Separately, we might also want to do the same thing for > `application/zip`...there are likely zip-based file types that we could do a > better job on. > Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2632) Analyze unknown govdocs files
[ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441077#comment-16441077 ] Andrew Jackson commented on TIKA-2632: -- It would be great to see the old PowerPoint sigs added to Tika, and AFAICT the false-positive rate for them is nothing to worry about (every matching file in our collection appears to be an old PowerPoint file). FWIW I think elsewhere (PDF?) we've used version identifiers of the form: {{application/vnd.ms-excel.sheet; version="2"}} But that may not be a good idea if it will confuse clients into thinking they can parse it using the usual parsers. > Analyze unknown govdocs files > - > > Key: TIKA-2632 > URL: https://issues.apache.org/jira/browse/TIKA-2632 > Project: Tika > Issue Type: Improvement >Reporter: Andreas Meier >Priority: Minor > > I recently started to analyze randomly govdocs1 files that could not be > recognized by TIKA properly. > > This ticket should be used to identify problems with old or proprietary files > and to extend TIKA step-by-step if needed. > > Stumbled across the following filetypes/files: > > 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized > properly: > Found some mysterious files starting with 0xeddead0b and 0x0baddeed > Turned out that someone else already investigated this case a month ago: > [link > http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/] > The files are old PowerPoint. (PowerPoint 3.0 or 2.0) > I think these Magic-strings should be added tika-mimetypes.xml as well as > another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or > application/vnd.ms-powerpoint.3 ?) > Example files in govdocs1: > 144/144504.unk > 272/272490.unk > 430/430427.unk > (several more...) > 2. Proprietary File Format: SigmaPlot Exchange File .jxf: > Magic: 0x000c4a5846 > Example file in govdocs1: > 975/975382.unk > 975/975383.unk > (several more...) > 3. There are two old excel file types which are not recognized at the Moment > (application/vnd.ms-excel.sheet.2): > 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of > 0x090004001000 > 224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of > 0x090004001000 > The magic for application/vnd.ms-excel.sheet.2 should be adapted: > 0x02001000 > and > 0x07001000 > must be added. > Furthermore we have to check whether the parser can be adapted to process all > the mentioned files. > (LibreOffice can open all of these files) > 4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3 > In file 611/611703.unk I found a 128-byte long header in front of the excel > file. > therefore the file could not be recognized correclty by TIKA > After I cut the header, the file could be recognized and converted by TIKA. > 5. SAS Data file > Example file: > 020/020505.unk > 6. AirSar Data (Airborne synthetic aperature Radar) > Example file: > 348/349489.unk (several more...) > 7. Advanced Data Format (ADF) > Used in CGNS (CFD General Notation System .cgns) > Example file: > 363/363966.unk > 8. Unknown Microsoft Word Document > Example file: > 202/202718.unk > (Recognized as Microsoft Word Document by Linux Magic) > 9. Unknown PowerPoint 3.0 file? > Example file: > 388/388212.unk > 10. Microsoft Compound File Binary File Format? > Example file > 857/857353.unk > Let me know if I should open a separate ticket for case 1. and 3.! > If there is any better place (except the mailing lists) to publish the > analyzation results let me know. > > Regards > > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635032#comment-14635032 ] Andrew Jackson commented on TIKA-1678: -- Sorry for the delay. Here are the results: * title starts with \376\377: 252,903 out of 21,204,500 PDFs. * title starts with \377: 0 out of 21,204,500 PDFs. * title starts with \357: 0 out of 21,204,500 PDFs. There is a tiny handful of mixed-up oddities, that look like this: {code} { url:http://www.praksis.gr/assets/files/h_PRAKSIS_sto_big_march.pdf;, wayback_date:20141205021311, title:(Microsoft Word - \\323\\365\\354\\354\\345\\364\\357\\367\\336 \\364\\347\\362 PRAKSIS \\363\\364\\357 BIG MARCH _1_), generator:[PScript5.dll Version 5.2.2, GPL Ghostscript 8.15]}, {code} (see the original here: http://web.archive.org/web/20150721122710/http://www.praksis.gr/assets/files/h_PRAKSIS_sto_big_march.pdf) But these are such minor exceptions I don't think it's worth pursuing. PDF metadata extraction fails to spot UTF-16 encoded title -- Key: TIKA-1678 URL: https://issues.apache.org/jira/browse/TIKA-1678 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.9 Reporter: Andrew Jackson Priority: Minor When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents. The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such. A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf Which contains this (literal file content): {noformat} 443 0 obj /Type/Metadata /Subtype/XML/Length 1978stream ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'? ?adobe-xap-filters esc=CRLF? x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6' rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/' rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'dc:titlerdf:Altrdf:li xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description /rdf:RDF /x:xmpmeta ?xpacket end='w'? endstream endobj 2 0 obj /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n) /CreationDate(D:20120718153801+01'00') /ModDate(D:20120718153801+01'00') /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x) /Author(\376\377\000T\000e\000t\000t\000i)endobj {noformat} Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the ones encoded in the actual PDF metadata fields should be extracted accurately. When extracted, we get: {noformat} ... dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627960#comment-14627960 ] Andrew Jackson commented on TIKA-1678: -- As far as I can tell, the PDF spec seems to imply that when you encoded UTF-16 data into text strings in the PDF document, you escape non-ASCII characters into octal. so \U0042 is treated as two separate bytes, and encoded as \000 and then B. Ick. PDF metadata extraction fails to spot UTF-16 encoded title -- Key: TIKA-1678 URL: https://issues.apache.org/jira/browse/TIKA-1678 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.9 Reporter: Andrew Jackson Priority: Minor When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents. The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such. A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf Which contains this (literal file content): {noformat} 443 0 obj /Type/Metadata /Subtype/XML/Length 1978stream ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'? ?adobe-xap-filters esc=CRLF? x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6' rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/' rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'dc:titlerdf:Altrdf:li xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description /rdf:RDF /x:xmpmeta ?xpacket end='w'? endstream endobj 2 0 obj /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n) /CreationDate(D:20120718153801+01'00') /ModDate(D:20120718153801+01'00') /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x) /Author(\376\377\000T\000e\000t\000t\000i)endobj {noformat} Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the ones encoded in the actual PDF metadata fields should be extracted accurately. When extracted, we get: {noformat} ... dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x meta:author: \376\377\000T\000e\000t\000t\000i meta:author: Tetti ... {noformat} So, the author appears to be decoded correctly once, but the title is not. Is the XML dc:title
[jira] [Created] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded data
Andrew Jackson created TIKA-1678: Summary: PDF metadata extraction fails to spot UTF-16 encoded data Key: TIKA-1678 URL: https://issues.apache.org/jira/browse/TIKA-1678 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.9 Reporter: Andrew Jackson Priority: Minor When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents. The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such. A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf Which contains this (literal file content): {noformat} 443 0 obj /Type/Metadata /Subtype/XML/Length 1978stream ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'? ?adobe-xap-filters esc=CRLF? x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6' rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/' rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'dc:titlerdf:Altrdf:li xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description /rdf:RDF /x:xmpmeta ?xpacket end='w'? endstream endobj 2 0 obj /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n) /CreationDate(D:20120718153801+01'00') /ModDate(D:20120718153801+01'00') /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x) /Author(\376\377\000T\000e\000t\000t\000i)endobj {noformat} Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the ones encoded in the actual PDF metadata fields should be extracted accurately. When extracted, we get: {noformat} ... dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x meta:author: \376\377\000T\000e\000t\000t\000i meta:author: Tetti ... {noformat} So, the author appears to be decoded correctly once, but the title is not. Is the XML dc:title being used to override the PDF title field? Or is one of the title fields being decoded incorrectly? (I accept that although this is a real PDF document from the web, it is also a malformed one, so maybe there is not much to be done here.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1678: - Summary: PDF metadata extraction fails to spot UTF-16 encoded title (was: PDF metadata extraction fails to spot UTF-16 encoded data) PDF metadata extraction fails to spot UTF-16 encoded title -- Key: TIKA-1678 URL: https://issues.apache.org/jira/browse/TIKA-1678 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.9 Reporter: Andrew Jackson Priority: Minor When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents. The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such. A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf Which contains this (literal file content): {noformat} 443 0 obj /Type/Metadata /Subtype/XML/Length 1978stream ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'? ?adobe-xap-filters esc=CRLF? x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6' rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/' rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'dc:titlerdf:Altrdf:li xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description /rdf:RDF /x:xmpmeta ?xpacket end='w'? endstream endobj 2 0 obj /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n) /CreationDate(D:20120718153801+01'00') /ModDate(D:20120718153801+01'00') /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x) /Author(\376\377\000T\000e\000t\000t\000i)endobj {noformat} Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the ones encoded in the actual PDF metadata fields should be extracted accurately. When extracted, we get: {noformat} ... dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x meta:author: \376\377\000T\000e\000t\000t\000i meta:author: Tetti ... {noformat} So, the author appears to be decoded correctly once, but the title is not. Is the XML dc:title being used to override the PDF title field? Or is one of the title fields being decoded incorrectly? (I accept that although this is a real PDF document from the web, it
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368858#comment-14368858 ] Andrew Jackson commented on TIKA-1154: -- Yes, thanks - that's the behaviour I'd hoped for. Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1486) Minor issues with the Tika MIME type magic file
[ https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1486: - Attachment: tika-mime-info-extensions-namespace.patch The attached patch adds a namespace declaration for the Tika extensions to the MIME Info tags. Minor issues with the Tika MIME type magic file --- Key: TIKA-1486 URL: https://issues.apache.org/jira/browse/TIKA-1486 Project: Tika Issue Type: Improvement Components: detector Affects Versions: 1.6 Reporter: Andrew Jackson Priority: Minor Attachments: tika-mime-info-extensions-namespace.patch I've started running some routine tests on format information held in a number of tools, including [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a number of minor issues when working with the tika-mimetypes.xml file: * Duplicate MIME type application/gzip-compressed for type application/gzip. * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg. * Error when parsing XML: Namespace prefix tika on link is not defined, line 169, column 15 * Format application/dita+xml;format=task has itself as a supertype! * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid filename specification. * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid filename specification. With the last two, it's really a matter of consistency. The other full-filename globs do *not* use the ^ and $ start and end markers, but owl and rdf do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226415#comment-14226415 ] Andrew Jackson commented on TIKA-1302: -- We have two more sets of data. One is the same as the 1996-2010 stuff, but from 2010 to April 2013, and for each item a copy can generally be accessed via the Internet Archive. We are planning to extend our indexing to the entire 1996-2013 dataset soon, but in reality its going to be a few months yet due to technical difficulties and other priorities. The second set of data runs from 2013 onwards, and due to the legal constraints on that material cannot be made available. However, for the next year or two, most of it will still be available on the live web, so that's the fallback option. That material has been indexed (although with an older Tika version), but we're going to re-index that too shortly, so we should also be able to make that available. (n.b. 'shortly' still means weeks or months!) Both of these data sets are large and contain more large files. There were c. 2 billion resources in the 1996-2010 chunk, and there are 1.5-2 billion in the 2010-2013 chunk, and over 2 billion per year since then, and in contrast to the early material, we do not limit the size per resource. So that should be interesting. However, it would be good to run against a broader range of material, given that I stop Tika from recursively processing ZIPs etc. and that web archives are rather weak on A/V files, systems files, software, etc. I'm not aware of a good A/V corpus, but on the systems and software side, there are the system images [also held at digitalcorpora.org|http://digitalcorpora.org/] and the [various files used by a RedHat dev to regression test the 'file' command|https://fedorahosted.org/file-tests/]. There is also [this small corpus of example files|https://github.com/openpreserve/format-corpus] that I have been contributing to lately, the [evolt browser archive|http://browsers.evolt.org/] and the [disktype filesystem image samples|http://disktype.cvs.sourceforge.net/viewvc/disktype/file-system-sampler/]. Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Attachments: wayback_exception_summaries.xlsx Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1486) Minor issues with the Tika MIME type magic file
Andrew Jackson created TIKA-1486: Summary: Minor issues with the Tika MIME type magic file Key: TIKA-1486 URL: https://issues.apache.org/jira/browse/TIKA-1486 Project: Tika Issue Type: Improvement Components: detector Affects Versions: 1.6 Reporter: Andrew Jackson Priority: Minor I've started running some routine tests on format information held in a number of tools, including [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a number of minor issues when working with the tika-mimetypes.xml file: * Duplicate MIME type application/gzip-compressed for type application/gzip. * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg. * Error when parsing XML: Namespace prefix tika on link is not defined, line 169, column 15 * Format application/dita+xml;format=task has itself as a supertype! * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid filename specification. * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid filename specification. With the last two, it's really a matter of consistency. The other full-filename globs do *not* use the ^ and $ start and end markers, but owl and rdf do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1486) Minor issues with the Tika MIME type magic file
[ https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224714#comment-14224714 ] Andrew Jackson commented on TIKA-1486: -- There's no problem with adding an XML namespace in principle - I'm not using a MIME-info specific parser or anything. It's just that because the namespace is not declared, the document is not [namespace-well-formed|http://stackoverflow.com/questions/14871752/is-xml-document-with-undeclared-prefix-well-formed], and this upsets some parsers. It's not critical - it just makes it harder to parse the document with an off-the-shelve XML parser configuration. On the globs, is there a functional difference between the ^rdf$ and rdf globs? If not, I'll just configure my analyser to strip out the ^ and $. Minor issues with the Tika MIME type magic file --- Key: TIKA-1486 URL: https://issues.apache.org/jira/browse/TIKA-1486 Project: Tika Issue Type: Improvement Components: detector Affects Versions: 1.6 Reporter: Andrew Jackson Priority: Minor I've started running some routine tests on format information held in a number of tools, including [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a number of minor issues when working with the tika-mimetypes.xml file: * Duplicate MIME type application/gzip-compressed for type application/gzip. * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg. * Error when parsing XML: Namespace prefix tika on link is not defined, line 169, column 15 * Format application/dita+xml;format=task has itself as a supertype! * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid filename specification. * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid filename specification. With the last two, it's really a matter of consistency. The other full-filename globs do *not* use the ^ and $ start and end markers, but owl and rdf do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1486) Minor issues with the Tika MIME type magic file
[ https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224745#comment-14224745 ] Andrew Jackson commented on TIKA-1486: -- A-ha! I didn't notice the {{isregex=true}} attribute - thank you! I'll modify my parser accordingly. FWIW, you don't need to make a schema to use a namespace, and it does not need to resolve to anything. But as I say, it's not crucial - I suppose all XML parsers can be configured to ignore the issue. Thanks again. Minor issues with the Tika MIME type magic file --- Key: TIKA-1486 URL: https://issues.apache.org/jira/browse/TIKA-1486 Project: Tika Issue Type: Improvement Components: detector Affects Versions: 1.6 Reporter: Andrew Jackson Priority: Minor I've started running some routine tests on format information held in a number of tools, including [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a number of minor issues when working with the tika-mimetypes.xml file: * Duplicate MIME type application/gzip-compressed for type application/gzip. * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg. * Error when parsing XML: Namespace prefix tika on link is not defined, line 169, column 15 * Format application/dita+xml;format=task has itself as a supertype! * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid filename specification. * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid filename specification. With the last two, it's really a matter of consistency. The other full-filename globs do *not* use the ^ and $ start and end markers, but owl and rdf do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209757#comment-14209757 ] Andrew Jackson commented on TIKA-1302: -- [~talli...@apache.org] I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/ Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with). Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that. Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209757#comment-14209757 ] Andrew Jackson edited comment on TIKA-1302 at 11/13/14 1:42 PM: [~talli...@apache.org] I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/ Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/\*xml) by Tika. i.e. the exceptions do *not* seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with). Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that. was (Author: anjackson): [~talli...@apache.org] I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/ Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with). Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that. Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186718#comment-14186718 ] Andrew Jackson commented on TIKA-1302: -- Shall I go ahead and extract the XML errors? Or would you rather I waited until we've re-run with the new version that will catch the permanent hangs and regenerate all the data? Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361 ] Andrew Jackson commented on TIKA-1302: -- Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361 ] Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM: - Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and [~talli...@apache.org]'s efforts to run this on GovDocs, and would be interested in comparing results. We already publish format profile data about web archives, and would love to have more data to refer to. was (Author: anjackson): Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176892#comment-14176892 ] Andrew Jackson commented on TIKA-1302: -- At the UK Web Archive we run Apache Tika over all our collections (it's been run over about 4 billion resources so far). We record the results in Apache Solr, to act as a search facet, and we also collect the Exceptions that are thrown when Tika fails. We can't make the content available to you directly, but perhaps there are datasets we can produce that would be useful to you? e.g. would a list of the exceptions that we've seen (along with the URL to the resource that caused the exception) be of interest? Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176934#comment-14176934 ] Andrew Jackson commented on TIKA-1302: -- I have 2,358,167 errors from one collection (2 billion resources), but the majority are SAXParseExceptions. It's made up of UK web archive content from 1996-2010, so there's lots of broken HTML/XML in there. If I strip out the SAXParseExceptions, there's just 317,548 miscellaneous errors, that are perhaps more interesting. Here's an example including the SAX exceptions: {code:none} wayback_date,url,content_length,content_type_tika,parse_error 20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,org.xml.sax.SAXParseException: Open quote is expected for attribute ID associated with an element type COMMENT. 20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548map_only=yestype=overview,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 2006102004,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5amp;type=rss;action=.xml,org.xml.sax.SAXParseException: The character reference must end with the ';' delimiter. 20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=coursessn=26,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20060926015856,http://www.qca.org.uk/4412.html,org.xml.sax.SAXParseException: The entity nbsp was referenced\, but not declared. 20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException: -1 20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=portercov=mode=buyo=4854130936code=9123cu=,org.xml.sax.SAXParseException: The element type META must be terminated by the matching end-tag /META. 20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException: The entity name must immediately follow the '' in the entity reference. {code} ...and for the others... {code:none} wayback_date,url,content_length,content_type_tika,parse_error 20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException: null 20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException: -1 20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException: Position 698368 past the end of the file 20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,java.io.IOException: Error: Expected a long type\, actual='25#0/' 20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException: oversubscribed literal/length tree 2004112115,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException: oversubscribed dynamic bit lengths tree 20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException: Default password is invalid for docId/saltData/saltHash 2005183952,http://freeweb.co.uk:80/show_nw.php?ref=258target=Bshow=affPHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html; charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125384#comment-14125384 ] Andrew Jackson commented on TIKA-1232: -- Looks like this is fixed and in the 1.6 release - thank you. Can the 'Fix version' on this ticket be updated accordingly? Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920698#comment-13920698 ] Andrew Jackson commented on TIKA-1232: -- Does anyone have a copy of Acrobat 9.1? That version uses Adobe Extension Level 5, so we'd need that to get the full set of recent versions. I'll have a dig around for suitable files for the versions that aren't covered yet, but most of the stuff I have access to is not re-licensable. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908402#comment-13908402 ] Andrew Jackson commented on TIKA-1232: -- Going by my original intention, then I would prefer the one additional dc:format to be of the form: {code} application/pdf; version=1.4 application/pdf; version=A-1a application/pdf; version=1.7 Adobe Extension Level 3 {code} Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900156#comment-13900156 ] Andrew Jackson commented on TIKA-1154: -- I've had no response on the metadata-extractor issue I raised. Not sure how to proceed with this, and it's continuing to cause us problems. Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894376#comment-13894376 ] Andrew Jackson commented on TIKA-1232: -- Great! For (1), very happy for that code to go to PDFBox. I'm pretty sure PDFBox doesn't already do anything along those lines, but I am not all that familiar with that codebase so it's worth checking first. As for (2), I've only tested on a fairly small number of PDFs because only the more recent versions of the Adobe tools actually make use of them, and even then, only when necessary. I ran that code against a web archive corpus containing around 2 billion resources, including many millions of PDFs, but because that dataset only ran up to 2010, I found a grand total of eight PDFs that used Adobe Extension Level 3. It worked fine on those! Finally, on the metadata property scheme, I feel the 'right place' is as a parameter on the Content Type, but I accept that may confuse client code (i.e. people assuming type.equals(application/pdf) should always work, even though that would be no good for other types like HTML due to the charset parameter). Note that the parameter approach also allows you to do version detection in Tika's [custom-mimetypes.xml|https://github.com/openplanets/nanite/blob/master/nanite-core/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml#L357], which I find rather handy. Of course, you are also welcome to take any of those signatures if they are of interest. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892210#comment-13892210 ] Andrew Jackson commented on TIKA-1232: -- Yes, you can't identify 1.7 PDF or the PDF/A variants unless you do a bit more parsing. In case it helps, here's the code I wrote to do that (and also extract other metadata of interest to me): https://github.com/openplanets/nanite/blob/master/nanite-ext/src/main/java/uk/bl/wa/tika/parser/pdf/pdfbox/PDFParser.java#L253 I couldn't do what I wanted by sub-classing the Tika code, so I copied the PDFParser and augmented it. If there is interest in taking this code into Tika I'd be willing to spend time turning it into a proper patch. As for how to record the result, this is definitely not the Application-Version. A modern version of Adobe Distiller can output various versions of PDF, because it chooses the version of the format based on the needs of the current document. i.e. if a document only requires PDF 1.4 features, it will output a PDF 1.4 and not just default to the latest version (AFAICT). My preference would be to use a version parameter on the content type. It's not a formally standardised approach, but has been adopted in a few places (e.g. [Java plugin versions|http://docs.oracle.com/javase/7/docs/technotes/guides/plugin/developer_guide/faq/basics.html#version]) In this case, you'd have something like: {quote} application/pdf; version=1.4 application/pdf; version=1.7 Adobe Extension Level 5 etc... {quote} although to avoid causing trouble for code that relies on the 'Content-Type' property, I have so far chosen to use a new property for this purpose (called 'Extended-Content-Type'). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756577#comment-13756577 ] Andrew Jackson commented on TIKA-1170: -- I'm not sure that commit is right. I see this in trunk: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} That is exactly what I did *not* wish to have, as files that successfully match using only this line: {code} match value=0x0020 mask=0xffe0 type=string offset=0/ {code} will lead to the false-positives I've been seeing. This is why I wanted to make the magic more specific, using the form: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Could we have that instead, please? Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Attachment: 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch This additional patch adds a realistic test file and an appropriate test. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756981#comment-13756981 ] Andrew Jackson commented on TIKA-1170: -- Thanks, that's great. If you prefer, you should be able to tell SVN to treat a particular file as binary data by setting svn MIME type property, as per this Stack Overflow answer: http://stackoverflow.com/a/74017/6689 Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757042#comment-13757042 ] Andrew Jackson commented on TIKA-1170: -- Fair point! Thanks for accepting the changes. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Summary: Insufficiently specific magic for binary image/cgm files (was: Possibly erroneous magic for image/cgm files) Changing title now I understand what's going on better. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1170) Possibly erroneous magic for image/cgm files
Andrew Jackson created TIKA-1170: Summary: Possibly erroneous magic for image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756051#comment-13756051 ] Andrew Jackson commented on TIKA-1170: -- My corpus is a chunk of the Internet Archive, so you can look at the CGM's I'm finding: * [all copies|http://web.archive.org/web/240100*/http://www.agocg.ac.uk/Graphics/CGM/RALCGM/sample.cgm], or a [specific copy| http://web.archive.org/web/2226055607/http://www.agocg.ac.uk/Graphics/CGM/RALCGM/sample.cgm]. ** Those example files now seem to be at http://www.agocg.ac.uk/train/cgm/examples/cgmindex.htm * or [this specific item|http://web.archive.org/web/20050223100939/http://wwwcms.brookes.ac.uk:80/webmsc2004/p00770/cgms/flyboat.cgm] from [this folder here|http://web.archive.org/web/20050112031156/http://wwwcms.brookes.ac.uk/webmsc2004/p00770/cgms/] * I also found these, but have not checked if any are binary http://www.fileformat.info/format/cgm/sample/index.htm Unfortunately,the licensing may not be clear in these cases, so these test files may not be suitable. If anyone knows of any software that can write binary CGM files, I'm willing to give it a go. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Attachment: plotutils-example.cgm This is an example version 3 binary CGM file, generated using GNU plotutils. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756064#comment-13756064 ] Andrew Jackson commented on TIKA-1170: -- I was able to create an example file, using [GNU plotutils|http://www.gnu.org/software/plotutils/] ('brew install plotutils'), as per [these instructions|http://www.gnu.org/software/plotutils/manual/en/plotutils.html#graph] {code} graph -T cgm datafile plot.cgm {code} I'll attach an example. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Attachment: 0001-Added-CGM-test-file-test-and-improved-magic.patch Patch containing test file, test, and improved magic. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
Andrew Jackson created TIKA-1154: Summary: Tika hangs on format detection of malformed HTML file. Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1154: - Attachment: tika-breaker.html This file makes tika hang. If you remove both of the binary characters (0x02 0x00), then it starts working again. Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719513#comment-13719513 ] Andrew Jackson commented on TIKA-1154: -- Thanks for the stacktrace, which lead me to this mailing list entry: http://mail-archives.apache.org/mod_mbox/tika-dev/201011.mbox/%3c5afe4d67-0c49-4947-94ba-f9b1f64ee...@transpac.com%3E which suggest that upgrading Xerces will fix this. Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719594#comment-13719594 ] Andrew Jackson commented on TIKA-1154: -- We could exclude the package from coming in via the metadata-extractor dependency and include the later version as a top-level dependency, but if there have been significant API changes between 2.8.1 and 2.10.0 then this could cause problems. I can submit an issue at https://code.google.com/p/metadata-extractor/issues/list and see if they're willing to upgrade? Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719631#comment-13719631 ] Andrew Jackson commented on TIKA-1154: -- Okay, I submitted an issue here: https://code.google.com/p/metadata-extractor/issues/detail?id=85 Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429426#comment-13429426 ] Andrew Jackson commented on TIKA-970: - Hi, I noticed the updated version includes a bit more information. In particular, the 'image/jpm' format is declared to have an alias of 'video/jpm'. This doesn't appear to be a registered MIME type, and I've not come across it before. Have you got any more information on this video format? Full identification of the JPEG 2000 family of formats -- Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New Feature Components: mime Affects Versions: 1.3 Reporter: Andrew Jackson Assignee: Jukka Zitting Priority: Minor Fix For: 1.3 Attachments: custom-mimetype.xml Please find attached a suitable set of magic definitions for allowing Tika to identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file formats. It is based on the 'file' magic from [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against the example files supplied on that site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-970) Full identification of the JPEG 2000 family of formats
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-970: Attachment: custom-mimetype.xml Full identification of the JPEG 2000 family of formats -- Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New Feature Components: mime Affects Versions: 1.3 Reporter: Andrew Jackson Priority: Minor Attachments: custom-mimetype.xml Please find attached a suitable set of magic definitions for allowing Tika to identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file formats. It is based on the 'file' magic from [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against the example files supplied on that site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428085#comment-13428085 ] Andrew Jackson commented on TIKA-970: - BTW, this set of signatures rather clumsily repeats the overall container signature for each sub-format. I don't know if this can be avoided, but just removing the repeat and expecting the subclass relationship to work out the details did not seem to work reliably. Full identification of the JPEG 2000 family of formats -- Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New Feature Components: mime Affects Versions: 1.3 Reporter: Andrew Jackson Priority: Minor Attachments: custom-mimetype.xml Please find attached a suitable set of magic definitions for allowing Tika to identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file formats. It is based on the 'file' magic from [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against the example files supplied on that site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428096#comment-13428096 ] Andrew Jackson commented on TIKA-970: - I should be able to sort that out. I know the author and I know that the project the work has been done under defaults to the Apache 2 licence. I've asked him to make the licensing on the magic files clear. Full identification of the JPEG 2000 family of formats -- Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New Feature Components: mime Affects Versions: 1.3 Reporter: Andrew Jackson Priority: Minor Attachments: custom-mimetype.xml Please find attached a suitable set of magic definitions for allowing Tika to identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file formats. It is based on the 'file' magic from [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against the example files supplied on that site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428108#comment-13428108 ] Andrew Jackson commented on TIKA-970: - I assume I'll need him to confirm an Apache 2 licence? Or are there compatible licences for derivative works? Full identification of the JPEG 2000 family of formats -- Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New Feature Components: mime Affects Versions: 1.3 Reporter: Andrew Jackson Priority: Minor Attachments: custom-mimetype.xml Please find attached a suitable set of magic definitions for allowing Tika to identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file formats. It is based on the 'file' magic from [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against the example files supplied on that site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428116#comment-13428116 ] Andrew Jackson commented on TIKA-970: - He's added the Apache licence here: https://github.com/bitsgalore/jp2kMagic/blob/master/magic/jpeg2000Magic It would still be handy to know if you'd accept similar derivatives of code under other licences in the future. Thanks. Full identification of the JPEG 2000 family of formats -- Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New Feature Components: mime Affects Versions: 1.3 Reporter: Andrew Jackson Priority: Minor Attachments: custom-mimetype.xml Please find attached a suitable set of magic definitions for allowing Tika to identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file formats. It is based on the 'file' magic from [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against the example files supplied on that site. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-900) Tika fails to detect ISO9660 disk images
Andrew Jackson created TIKA-900: --- Summary: Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Attachment: iso-image-detection.patch Patch to increase buffer size and fix ISO image detection. Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor Attachments: iso-image-detection.patch I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Attachment: (was: iso-image-detection.patch) Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor Attachments: iso-image-detection.patch I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Attachment: iso-image-detection.patch Patch to fix ISO image magic, and extended the buffer size so that the magic can match. Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor Attachments: iso-image-detection.patch I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Description: I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. (was: I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB.) Fixing a typo. Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor Attachments: iso-image-detection.patch I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-900) Tika fails to detect ISO9660 disk images
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259615#comment-13259615 ] Andrew Jackson commented on TIKA-900: - I re-uploaded the patch as it had an extra format that is not necessary for this patch. Also I noticed a typo in my original issue description. The offset should be 32769, not 37633. Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor Attachments: iso-image-detection.patch I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Description: I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far too small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. (was: I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far to small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB.) Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Environment: Any. Reporter: Andrew Jackson Priority: Minor Attachments: iso-image-detection.patch I have been testing Tika's ability to identify ISO9660 disk image file systems, and discovered two problems. Firstly, the offset match matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it was impossible for that signaure to ever match, because the default buffer size was far too small. It is currently set to 8KB, and as this signature is some 36KB into the file, Tika could never find the match. The attached patch fixes the magic, and extends the buffer to 64KB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira