[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data

2023-03-29 Thread Andrew Jackson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706400#comment-17706400
 ] 

Andrew Jackson commented on TIKA-3992:
--

Sounds interesting! Just wanted to note that Siegfried (and DROID/etc) 
signatures often require end-of-file matches as well as beginning-of-file, so 
if you do truncate the files you'll get the best results by chopping out the 
middle. I'd imagine the first and last few KB should do it.

> Add common missing mimes based on Common Crawl data
> ---
>
> Key: TIKA-3992
> URL: https://issues.apache.org/jira/browse/TIKA-3992
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2632) Analyze unknown govdocs files

2018-04-17 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441077#comment-16441077
 ] 

Andrew Jackson commented on TIKA-2632:
--

It would be great to see the old PowerPoint sigs added to Tika, and AFAICT the 
false-positive rate for them is nothing to worry about (every matching file in 
our collection appears to be an old PowerPoint file).

FWIW I think elsewhere (PDF?) we've used version identifiers of the form:

{{application/vnd.ms-excel.sheet; version="2"}}

But that may not be a good idea if it will confuse clients into thinking they 
can parse it using the usual parsers.

> Analyze unknown govdocs files
> -
>
> Key: TIKA-2632
> URL: https://issues.apache.org/jira/browse/TIKA-2632
> Project: Tika
>  Issue Type: Improvement
>Reporter: Andreas Meier
>Priority: Minor
>
> I recently started to analyze randomly govdocs1 files that could not be 
> recognized by TIKA properly.
>  
> This ticket should be used to identify problems with old or proprietary files 
> and to extend TIKA step-by-step if needed.
>  
> Stumbled across the following filetypes/files:
>  
> 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
> properly:
> Found some mysterious files starting with 0xeddead0b and 0x0baddeed
> Turned out that someone else already investigated this case a month ago:
> [link 
> http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]
> The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
> I think these Magic-strings should be added tika-mimetypes.xml as well as 
> another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
> application/vnd.ms-powerpoint.3 ?)
> Example files in govdocs1: 
> 144/144504.unk
> 272/272490.unk
> 430/430427.unk
> (several more...)
> 2. Proprietary File Format: SigmaPlot Exchange File .jxf:
> Magic: 0x000c4a5846
> Example file in govdocs1:
> 975/975382.unk
> 975/975383.unk
>  (several more...)
> 3. There are two old excel file types which are not recognized at the Moment 
> (application/vnd.ms-excel.sheet.2):
> 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 
> 0x090004001000
> 224/224485.unk and 615/615187.unk start with  0x0900040002001000 instead of 
> 0x090004001000
> The magic for application/vnd.ms-excel.sheet.2 should be adapted:
> 0x02001000
> and
> 0x07001000
> must be added.
> Furthermore we have to check whether the parser can be adapted to process all 
> the mentioned files.
> (LibreOffice can open all of these files)
> 4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3
> In file 611/611703.unk I found a 128-byte long header in front of the excel 
> file.
> therefore the file could not be recognized correclty by TIKA
> After I cut the header, the file could be recognized and converted by TIKA.
> 5. SAS Data file
> Example file:
> 020/020505.unk
> 6. AirSar Data (Airborne synthetic aperature Radar)
> Example file:
> 348/349489.unk (several more...)
> 7. Advanced Data Format (ADF)
> Used in CGNS (CFD General Notation System .cgns)
> Example file:
> 363/363966.unk
> 8. Unknown Microsoft Word Document
> Example file:
> 202/202718.unk
> (Recognized as Microsoft Word Document by Linux Magic)
> 9. Unknown PowerPoint 3.0 file?
> Example file:
> 388/388212.unk
> 10. Microsoft Compound File Binary File Format?
> Example file
> 857/857353.unk
> Let me know if I should open a separate ticket for case 1. and 3.!
> If there is any better place (except the mailing lists) to publish the 
> analyzation results let me know.
>  
> Regards
>  
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635032#comment-14635032
 ] 

Andrew Jackson commented on TIKA-1678:
--

Sorry for the delay. Here are the results:

* title starts with \376\377: 252,903 out of 21,204,500 PDFs.
* title starts with \377: 0 out of 21,204,500 PDFs.
* title starts with \357: 0 out of 21,204,500 PDFs.

There is a tiny handful of mixed-up oddities, that look like this:

{code}
{
url:http://www.praksis.gr/assets/files/h_PRAKSIS_sto_big_march.pdf;,
wayback_date:20141205021311,
title:(Microsoft Word - 
\\323\\365\\354\\354\\345\\364\\357\\367\\336 \\364\\347\\362 PRAKSIS 
\\363\\364\\357 BIG MARCH _1_),
generator:[PScript5.dll Version 5.2.2,
  GPL Ghostscript 8.15]},
{code}

(see the original here: 
http://web.archive.org/web/20150721122710/http://www.praksis.gr/assets/files/h_PRAKSIS_sto_big_march.pdf)

But these are such minor exceptions I don't think it's worth pursuing. 

 PDF metadata extraction fails to spot UTF-16 encoded title
 --

 Key: TIKA-1678
 URL: https://issues.apache.org/jira/browse/TIKA-1678
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.9
Reporter: Andrew Jackson
Priority: Minor

 When extracting metadata from PDFs, we see some odd behaviour for a minority 
 of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
 not always being decoded as such.
 A specific example is here: 
 http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
 Which contains this (literal file content):
 {noformat}
 443 0 obj
 /Type/Metadata
 /Subtype/XML/Length 1978stream
 ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?
 ?adobe-xap-filters esc=CRLF?
 x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
 1.6'
 rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
 xmlns:iX='http://ns.adobe.com/iX/1.0/'
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
 pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate
 xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate
 xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
 xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:dc='http://purl.org/dc/elements/1.1/' 
 dc:format='application/pdf'dc:titlerdf:Altrdf:li 
 xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description
 /rdf:RDF
 /x:xmpmeta
 ?xpacket end='w'?
 endstream
 endobj
 2 0 obj
 /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n)
 /CreationDate(D:20120718153801+01'00')
 /ModDate(D:20120718153801+01'00')
 /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
 /Author(\376\377\000T\000e\000t\000t\000i)endobj
 {noformat} 
 Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
 error, but the ones encoded in the actual PDF metadata fields should be 
 extracted accurately.
 When extracted, we get:
 {noformat}
 ...
 dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-15 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627960#comment-14627960
 ] 

Andrew Jackson commented on TIKA-1678:
--

As far as I can tell, the PDF spec seems to imply that when you encoded UTF-16 
data into text strings in the PDF document, you escape non-ASCII characters 
into octal. so \U0042 is treated as two separate bytes, and encoded as \000 and 
then B. Ick.

 PDF metadata extraction fails to spot UTF-16 encoded title
 --

 Key: TIKA-1678
 URL: https://issues.apache.org/jira/browse/TIKA-1678
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.9
Reporter: Andrew Jackson
Priority: Minor

 When extracting metadata from PDFs, we see some odd behaviour for a minority 
 of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
 not always being decoded as such.
 A specific example is here: 
 http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
 Which contains this (literal file content):
 {noformat}
 443 0 obj
 /Type/Metadata
 /Subtype/XML/Length 1978stream
 ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?
 ?adobe-xap-filters esc=CRLF?
 x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
 1.6'
 rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
 xmlns:iX='http://ns.adobe.com/iX/1.0/'
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
 pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate
 xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate
 xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
 xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:dc='http://purl.org/dc/elements/1.1/' 
 dc:format='application/pdf'dc:titlerdf:Altrdf:li 
 xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description
 /rdf:RDF
 /x:xmpmeta
 ?xpacket end='w'?
 endstream
 endobj
 2 0 obj
 /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n)
 /CreationDate(D:20120718153801+01'00')
 /ModDate(D:20120718153801+01'00')
 /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
 /Author(\376\377\000T\000e\000t\000t\000i)endobj
 {noformat} 
 Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
 error, but the ones encoded in the actual PDF metadata fields should be 
 extracted accurately.
 When extracted, we get:
 {noformat}
 ...
 dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
 title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
 meta:author: \376\377\000T\000e\000t\000t\000i
 meta:author: Tetti
 ...
 {noformat}
 So, the author appears to be decoded correctly once, but the title is not. Is 
 the XML dc:title 

[jira] [Created] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded data

2015-07-14 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1678:


 Summary: PDF metadata extraction fails to spot UTF-16 encoded data
 Key: TIKA-1678
 URL: https://issues.apache.org/jira/browse/TIKA-1678
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.9
Reporter: Andrew Jackson
Priority: Minor


When extracting metadata from PDFs, we see some odd behaviour for a minority of 
the documents. The PDF metadata can be encoded as UTF-18 octets, but is not 
always being decoded as such.

A specific example is here: 
http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf

Which contains this (literal file content):

{noformat}
443 0 obj
/Type/Metadata
/Subtype/XML/Length 1978stream
?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?
?adobe-xap-filters esc=CRLF?
x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
1.6'
rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
xmlns:iX='http://ns.adobe.com/iX/1.0/'
rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
\000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
\000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
\000E\000d\000i\000t\000i\000o\000n'/
rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate
xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate
xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description
rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/
rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
xmlns:dc='http://purl.org/dc/elements/1.1/' 
dc:format='application/pdf'dc:titlerdf:Altrdf:li 
xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description
/rdf:RDF
/x:xmpmeta


?xpacket end='w'?
endstream
endobj
2 0 obj
/Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 
\000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
\000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
\000E\000d\000i\000t\000i\000o\000n)
/CreationDate(D:20120718153801+01'00')
/ModDate(D:20120718153801+01'00')
/Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
/Author(\376\377\000T\000e\000t\000t\000i)endobj
{noformat} 

Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, 
but the ones encoded in the actual PDF metadata fields should be extracted 
accurately.

When extracted, we get:
{noformat}
...
dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
meta:author: \376\377\000T\000e\000t\000t\000i
meta:author: Tetti
...
{noformat}

So, the author appears to be decoded correctly once, but the title is not. Is 
the XML dc:title being used to override the PDF title field? Or is one of the 
title fields being decoded incorrectly?

(I accept that although this is a real PDF document from the web, it is also a 
malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-14 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1678:
-
Summary: PDF metadata extraction fails to spot UTF-16 encoded title  (was: 
PDF metadata extraction fails to spot UTF-16 encoded data)

 PDF metadata extraction fails to spot UTF-16 encoded title
 --

 Key: TIKA-1678
 URL: https://issues.apache.org/jira/browse/TIKA-1678
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.9
Reporter: Andrew Jackson
Priority: Minor

 When extracting metadata from PDFs, we see some odd behaviour for a minority 
 of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
 not always being decoded as such.
 A specific example is here: 
 http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
 Which contains this (literal file content):
 {noformat}
 443 0 obj
 /Type/Metadata
 /Subtype/XML/Length 1978stream
 ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?
 ?adobe-xap-filters esc=CRLF?
 x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
 1.6'
 rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
 xmlns:iX='http://ns.adobe.com/iX/1.0/'
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
 pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate
 xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate
 xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
 xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:dc='http://purl.org/dc/elements/1.1/' 
 dc:format='application/pdf'dc:titlerdf:Altrdf:li 
 xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description
 /rdf:RDF
 /x:xmpmeta
 ?xpacket end='w'?
 endstream
 endobj
 2 0 obj
 /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n)
 /CreationDate(D:20120718153801+01'00')
 /ModDate(D:20120718153801+01'00')
 /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
 /Author(\376\377\000T\000e\000t\000t\000i)endobj
 {noformat} 
 Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
 error, but the ones encoded in the actual PDF metadata fields should be 
 extracted accurately.
 When extracted, we get:
 {noformat}
 ...
 dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
 title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
 meta:author: \376\377\000T\000e\000t\000t\000i
 meta:author: Tetti
 ...
 {noformat}
 So, the author appears to be decoded correctly once, but the title is not. Is 
 the XML dc:title being used to override the PDF title field? Or is one of the 
 title fields being decoded incorrectly?
 (I accept that although this is a real PDF document from the web, it 

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2015-03-19 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368858#comment-14368858
 ] 

Andrew Jackson commented on TIKA-1154:
--

Yes, thanks - that's the behaviour I'd hoped for.

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-27 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1486:
-
Attachment: tika-mime-info-extensions-namespace.patch

The attached patch adds a namespace declaration for the Tika extensions to the 
MIME Info tags. 

 Minor issues with the Tika MIME type magic file
 ---

 Key: TIKA-1486
 URL: https://issues.apache.org/jira/browse/TIKA-1486
 Project: Tika
  Issue Type: Improvement
  Components: detector
Affects Versions: 1.6
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-mime-info-extensions-namespace.patch


 I've started running some routine tests on format information held in a 
 number of tools, including 
 [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a 
 number of minor issues when working with the tika-mimetypes.xml file:
 * Duplicate MIME type application/gzip-compressed for type application/gzip.
 * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg.
 * Error when parsing XML: Namespace prefix tika on link is not defined, line 
 169, column 15
 * Format application/dita+xml;format=task has itself as a supertype!
 * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid 
 filename specification.
 * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid 
 filename specification.
 With the last two, it's really a matter of consistency. The other 
 full-filename globs do *not* use the ^ and $ start and end markers, but owl 
 and rdf do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226415#comment-14226415
 ] 

Andrew Jackson commented on TIKA-1302:
--

We have two more sets of data. One is the same as the 1996-2010 stuff, but from 
2010 to April 2013, and for each item a copy can generally be accessed via the 
Internet Archive. We are planning to extend our indexing to the entire 
1996-2013 dataset soon, but in reality its going to be a few months yet due to 
technical difficulties and other priorities. The second set of data runs from 
2013 onwards, and due to the legal constraints on that material cannot be made 
available. However, for the next year or two, most of it will still be 
available on the live web, so that's the fallback option. That material has 
been indexed (although with an older Tika version), but we're going to re-index 
that too shortly, so we should also be able to make that available. (n.b. 
'shortly' still means weeks or months!)

Both of these data sets are large and contain more large files. There were c. 2 
billion resources in the 1996-2010 chunk, and there are 1.5-2 billion in the 
2010-2013 chunk, and over 2 billion per year since then, and in contrast to the 
early material, we do not limit the size per resource. So that should be 
interesting.

However, it would be good to run against a broader range of material, given 
that I stop Tika from recursively processing ZIPs etc. and that web archives 
are rather weak on A/V files, systems files, software, etc. I'm not aware of a 
good A/V corpus, but on the systems and software side, there are the system 
images [also held at digitalcorpora.org|http://digitalcorpora.org/] and the 
[various files used by a RedHat dev to regression test the 'file' 
command|https://fedorahosted.org/file-tests/]. There is also [this small corpus 
of example files|https://github.com/openpreserve/format-corpus] that I have 
been contributing to lately, the [evolt browser 
archive|http://browsers.evolt.org/] and the [disktype filesystem image 
samples|http://disktype.cvs.sourceforge.net/viewvc/disktype/file-system-sampler/].

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison
 Attachments: wayback_exception_summaries.xlsx


 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-25 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1486:


 Summary: Minor issues with the Tika MIME type magic file
 Key: TIKA-1486
 URL: https://issues.apache.org/jira/browse/TIKA-1486
 Project: Tika
  Issue Type: Improvement
  Components: detector
Affects Versions: 1.6
Reporter: Andrew Jackson
Priority: Minor


I've started running some routine tests on format information held in a number 
of tools, including 
[Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a 
number of minor issues when working with the tika-mimetypes.xml file:

* Duplicate MIME type application/gzip-compressed for type application/gzip.
* Duplicate MIME type image/vnd.dwg for type image/vnd.dwg.
* Error when parsing XML: Namespace prefix tika on link is not defined, line 
169, column 15
* Format application/dita+xml;format=task has itself as a supertype!
* Glob '^owl$' for entry application/rdf+xml does not appear to be a valid 
filename specification.
* Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid 
filename specification.

With the last two, it's really a matter of consistency. The other full-filename 
globs do *not* use the ^ and $ start and end markers, but owl and rdf do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-25 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224714#comment-14224714
 ] 

Andrew Jackson commented on TIKA-1486:
--

There's no problem with adding an XML namespace in principle - I'm not using a 
MIME-info specific parser or anything. It's just that because the namespace is 
not declared, the document is not 
[namespace-well-formed|http://stackoverflow.com/questions/14871752/is-xml-document-with-undeclared-prefix-well-formed],
 and this upsets some parsers. It's not critical - it just makes it harder to 
parse the document with an off-the-shelve XML parser configuration.

On the globs, is there a functional difference between the ^rdf$ and rdf 
globs? If not, I'll just configure my analyser to strip out the ^ and $.

 Minor issues with the Tika MIME type magic file
 ---

 Key: TIKA-1486
 URL: https://issues.apache.org/jira/browse/TIKA-1486
 Project: Tika
  Issue Type: Improvement
  Components: detector
Affects Versions: 1.6
Reporter: Andrew Jackson
Priority: Minor

 I've started running some routine tests on format information held in a 
 number of tools, including 
 [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a 
 number of minor issues when working with the tika-mimetypes.xml file:
 * Duplicate MIME type application/gzip-compressed for type application/gzip.
 * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg.
 * Error when parsing XML: Namespace prefix tika on link is not defined, line 
 169, column 15
 * Format application/dita+xml;format=task has itself as a supertype!
 * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid 
 filename specification.
 * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid 
 filename specification.
 With the last two, it's really a matter of consistency. The other 
 full-filename globs do *not* use the ^ and $ start and end markers, but owl 
 and rdf do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-25 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224745#comment-14224745
 ] 

Andrew Jackson commented on TIKA-1486:
--

A-ha! I didn't notice the {{isregex=true}} attribute - thank you! I'll modify 
my parser accordingly.

FWIW, you don't need to make a schema to use a namespace, and it does not need 
to resolve to anything. But as I say, it's not crucial - I suppose all XML 
parsers can be configured to ignore the issue.

Thanks again.

 Minor issues with the Tika MIME type magic file
 ---

 Key: TIKA-1486
 URL: https://issues.apache.org/jira/browse/TIKA-1486
 Project: Tika
  Issue Type: Improvement
  Components: detector
Affects Versions: 1.6
Reporter: Andrew Jackson
Priority: Minor

 I've started running some routine tests on format information held in a 
 number of tools, including 
 [Tika|http://www.digipres.org/formats/sources/tika/issues/]. This uncovered a 
 number of minor issues when working with the tika-mimetypes.xml file:
 * Duplicate MIME type application/gzip-compressed for type application/gzip.
 * Duplicate MIME type image/vnd.dwg for type image/vnd.dwg.
 * Error when parsing XML: Namespace prefix tika on link is not defined, line 
 169, column 15
 * Format application/dita+xml;format=task has itself as a supertype!
 * Glob '^owl$' for entry application/rdf+xml does not appear to be a valid 
 filename specification.
 * Glob '^rdf$' for entry application/rdf+xml does not appear to be a valid 
 filename specification.
 With the last two, it's really a matter of consistency. The other 
 full-filename globs do *not* use the ^ and $ start and end markers, but owl 
 and rdf do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-13 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209757#comment-14209757
 ] 

Andrew Jackson commented on TIKA-1302:
--

[~talli...@apache.org] I've created a download folder on our own site, and 
included a dump of about 1/8th of the SAX errors, here: 
http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

Looking through the SAX exceptions, they do seem to be from resources that are 
identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem 
to be coming from malformed HTML, which is consistent with the standard Tika 
configuration you described above (which I can confirm is what we ran with).

Unfortunately, I can't recover the full stack traces from that run, and it's 
not clear if we'll be able to do that in the future because of the way we're 
doing the indexing, but we'll look at it and hopefully be able to record the 
full error in the future. For now, you'll have to re-run the source item 
through Tika to reproduce the error - sorry about that.

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-13 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209757#comment-14209757
 ] 

Andrew Jackson edited comment on TIKA-1302 at 11/13/14 1:42 PM:


[~talli...@apache.org] I've created a download folder on our own site, and 
included a dump of about 1/8th of the SAX errors, here: 
http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

Looking through the SAX exceptions, they do seem to be from resources that are 
identified as XML (application/\*xml) by Tika. i.e. the exceptions do *not* 
seem to be coming from malformed HTML, which is consistent with the standard 
Tika configuration you described above (which I can confirm is what we ran 
with).

Unfortunately, I can't recover the full stack traces from that run, and it's 
not clear if we'll be able to do that in the future because of the way we're 
doing the indexing, but we'll look at it and hopefully be able to record the 
full error in the future. For now, you'll have to re-run the source item 
through Tika to reproduce the error - sorry about that.


was (Author: anjackson):
[~talli...@apache.org] I've created a download folder on our own site, and 
included a dump of about 1/8th of the SAX errors, here: 
http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

Looking through the SAX exceptions, they do seem to be from resources that are 
identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem 
to be coming from malformed HTML, which is consistent with the standard Tika 
configuration you described above (which I can confirm is what we ran with).

Unfortunately, I can't recover the full stack traces from that run, and it's 
not clear if we'll be able to do that in the future because of the way we're 
doing the indexing, but we'll look at it and hopefully be able to record the 
full error in the future. For now, you'll have to re-run the source item 
through Tika to reproduce the error - sorry about that.

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-28 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186718#comment-14186718
 ] 

Andrew Jackson commented on TIKA-1302:
--

Shall I go ahead and extract the XML errors? Or would you rather I waited until 
we've re-run with the new version that will catch the permanent hangs and 
regenerate all the data?

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361
 ] 

Andrew Jackson commented on TIKA-1302:
--

Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361
 ] 

Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM:
-

Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and 
[~talli...@apache.org]'s efforts to run this on GovDocs, and would be 
interested in comparing results. We already publish format profile data about 
web archives, and would love to have more data to refer to.


was (Author: anjackson):
Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176892#comment-14176892
 ] 

Andrew Jackson commented on TIKA-1302:
--

At the UK Web Archive we run Apache Tika over all our collections (it's been 
run over about 4 billion resources so far). We record the results in Apache 
Solr, to act as a search facet, and we also collect the Exceptions that are 
thrown when Tika fails. We can't make the content available to you directly, 
but perhaps there are datasets we can produce that would be useful to you? e.g. 
would a list of the exceptions that we've seen (along with the URL to the 
resource that caused the exception) be of interest?

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176934#comment-14176934
 ] 

Andrew Jackson commented on TIKA-1302:
--

I have 2,358,167 errors from one collection (2 billion resources), but the 
majority are SAXParseExceptions. It's made up of UK web archive content from 
1996-2010, so there's lots of broken HTML/XML in there. If I strip out the 
SAXParseExceptions, there's just 317,548 miscellaneous errors, that are perhaps 
more interesting. 

Here's an example including the SAX exceptions:
{code:none}
wayback_date,url,content_length,content_type_tika,parse_error
20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,org.xml.sax.SAXParseException:
 Open quote is expected for attribute ID associated with an  element type  
COMMENT.
20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548map_only=yestype=overview,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
2006102004,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5amp;type=rss;action=.xml,org.xml.sax.SAXParseException:
 The character reference must end with the ';' delimiter.
20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=coursessn=26,org.xml.sax.SAXParseException:
 The markup in the document following the root element must be well-formed.
20060926015856,http://www.qca.org.uk/4412.html,org.xml.sax.SAXParseException: 
The entity nbsp was referenced\, but not declared.
20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException:
 -1
20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=portercov=mode=buyo=4854130936code=9123cu=,org.xml.sax.SAXParseException:
 The element type META must be terminated by the matching end-tag 
/META.
20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException:
 The entity name must immediately follow the '' in the entity reference.
{code}
...and for the others...
{code:none}
wayback_date,url,content_length,content_type_tika,parse_error
20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException:
 null
20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException:
 -1
20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException:
 Position 698368 past the end of the file
20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
 Error: The supplied password does not match either the owner or user password 
in the document.
20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,java.io.IOException:
 Error: Expected a long type\, actual='25#0/'
20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException:
 oversubscribed literal/length tree
2004112115,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException:
 oversubscribed dynamic bit lengths tree
20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
 Error: The supplied password does not match either the owner or user password 
in the document.
20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException:
 Error: The supplied password does not match either the owner or user password 
in the document.
20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException:
 Default password is invalid for docId/saltData/saltHash
2005183952,http://freeweb.co.uk:80/show_nw.php?ref=258target=Bshow=affPHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html;
 
charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: 
Suspected zip bomb: 100 levels of XML element nesting

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-09-08 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125384#comment-14125384
 ] 

Andrew Jackson commented on TIKA-1232:
--

Looks like this is fixed and in the 1.6 release - thank you. Can the 'Fix 
version' on this ticket be updated accordingly?

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-05 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920698#comment-13920698
 ] 

Andrew Jackson commented on TIKA-1232:
--

Does anyone have a copy of Acrobat 9.1? That version uses Adobe Extension Level 
5, so we'd need that to get the full set of recent versions. I'll have a dig 
around for suitable files for the versions that aren't covered yet, but most of 
the stuff I have access to is not re-licensable.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908402#comment-13908402
 ] 

Andrew Jackson commented on TIKA-1232:
--

Going by my original intention, then I would prefer the one additional 
dc:format to be of the form:

{code}
application/pdf; version=1.4
application/pdf; version=A-1a
application/pdf; version=1.7 Adobe Extension Level 3
{code}

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2014-02-13 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900156#comment-13900156
 ] 

Andrew Jackson commented on TIKA-1154:
--

I've had no response on the metadata-extractor issue I raised. Not sure how to 
proceed with this, and it's continuing to cause us problems.

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-07 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894376#comment-13894376
 ] 

Andrew Jackson commented on TIKA-1232:
--

Great!

For (1), very happy for that code to go to PDFBox. I'm pretty sure PDFBox 
doesn't already do anything along those lines, but I am not all that familiar 
with that codebase so it's worth checking first.

As for (2), I've only tested on a fairly small number of PDFs because only the 
more recent versions of the Adobe tools actually make use of them, and even 
then, only when necessary. I ran that code against a web archive corpus 
containing around 2 billion resources, including many millions of PDFs, but 
because that dataset only ran up to 2010, I found a grand total of eight PDFs 
that used Adobe Extension Level 3. It worked fine on those!

Finally, on the metadata property scheme, I feel the 'right place' is as a 
parameter on the Content Type, but I accept that may confuse client code (i.e. 
people assuming type.equals(application/pdf) should always work, even though 
that would be no good for other types like HTML due to the charset parameter). 

Note that the parameter approach also allows you to do version detection in 
Tika's 
[custom-mimetypes.xml|https://github.com/openplanets/nanite/blob/master/nanite-core/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml#L357],
 which I find rather handy. Of course, you are also welcome to take any of 
those signatures if they are of interest.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892210#comment-13892210
 ] 

Andrew Jackson commented on TIKA-1232:
--

Yes, you can't identify  1.7 PDF or the PDF/A variants unless you do a bit 
more parsing. In case it helps, here's the code I wrote to do that (and also 
extract other metadata of interest to me):

https://github.com/openplanets/nanite/blob/master/nanite-ext/src/main/java/uk/bl/wa/tika/parser/pdf/pdfbox/PDFParser.java#L253

I couldn't do what I wanted by sub-classing the Tika code, so I copied the 
PDFParser and augmented it. If there is interest in taking this code into Tika 
I'd be willing to spend time turning it into a proper patch.

As for how to record the result, this is definitely not the 
Application-Version. A modern version of Adobe Distiller can output various 
versions of PDF, because it chooses the version of the format based on the 
needs of the current document. i.e. if a document only requires PDF 1.4 
features, it will output a PDF 1.4 and not just default to the latest version 
(AFAICT).

My preference would be to use a version parameter on the content type. It's not 
a formally standardised approach, but has been adopted in a few places (e.g. 
[Java plugin 
versions|http://docs.oracle.com/javase/7/docs/technotes/guides/plugin/developer_guide/faq/basics.html#version])

In this case, you'd have something like:

{quote}
application/pdf; version=1.4
application/pdf; version=1.7 Adobe Extension Level 5
etc...
{quote}

although to avoid causing trouble for code that relies on the 'Content-Type' 
property, I have so far chosen to use a new property for this purpose (called 
'Extended-Content-Type'). 

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756577#comment-13756577
 ] 

Andrew Jackson commented on TIKA-1170:
--

I'm not sure that commit is right.  I see this in trunk:

{code}
  match value=BEGMF type=string offset=0/
  match value=0x0020 mask=0xffe0 type=string offset=0/
  match value=0x0020 mask=0xffe0 type=string offset=0
match value=0x10220001 type=string offset=2:64/
match value=0x10220002 type=string offset=2:64/
match value=0x10220003 type=string offset=2:64/
match value=0x10220004 type=string offset=2:64/
  /match
{code}

That is exactly what I did *not* wish to have, as files that successfully match 
using only this line:

{code}
  match value=0x0020 mask=0xffe0 type=string offset=0/
{code}

will lead to the false-positives I've been seeing. This is why I wanted to make 
the magic more specific, using the form:

{code}
  match value=BEGMF type=string offset=0/
  match value=0x0020 mask=0xffe0 type=string offset=0
match value=0x10220001 type=string offset=2:64/
match value=0x10220002 type=string offset=2:64/
match value=0x10220003 type=string offset=2:64/
match value=0x10220004 type=string offset=2:64/
  /match
{code}

Could we have that instead, please?

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1170:
-

Attachment: 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch

This additional patch adds a realistic test file and an appropriate test.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756981#comment-13756981
 ] 

Andrew Jackson commented on TIKA-1170:
--

Thanks, that's great. If you prefer, you should be able to tell SVN to treat a 
particular file as binary data by setting svn MIME type property, as per this 
Stack Overflow answer: http://stackoverflow.com/a/74017/6689

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757042#comment-13757042
 ] 

Andrew Jackson commented on TIKA-1170:
--

Fair point! Thanks for accepting the changes.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1170:
-

Summary: Insufficiently specific magic for binary image/cgm files  (was: 
Possibly erroneous magic for image/cgm files)

Changing title now I understand what's going on better.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor

 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1170) Possibly erroneous magic for image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1170:


 Summary: Possibly erroneous magic for image/cgm files
 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor


I've been running Tika against a large corpus of web archives files, and I'm 
seeing a number of false positives for image/cgm. The Tika magic is
{code}
  match value=BEGMF type=string offset=0/
  match value=0x0020 mask=0xffe0 type=string offset=0/
{code}
The issue seems to be that the second magic matcher is not very specific, e.g. 
matching files that start 0x002a. To be fair, this is only c.700 false matches 
out of 300 million resources, but it would be nice if this could be tightened 
up. 

Looking at the PRONOM signatures
* 
http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
* 
http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
* 
http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
* 
http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
it seems we have a variable position marker that changes slightly for each 
version. Therefore, a more robust signature should be:

{code}
  match value=BEGMF type=string offset=0/
  match value=0x0020 mask=0xffe0 type=string offset=0
match value=0x10220001 type=string offset=2:64/
match value=0x10220002 type=string offset=2:64/
match value=0x10220003 type=string offset=2:64/
match value=0x10220004 type=string offset=2:64/
  /match
{code}

Where I have assumed the filename part of the CGM file will be less that 64 
characters long.

Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756051#comment-13756051
 ] 

Andrew Jackson commented on TIKA-1170:
--

My corpus is a chunk of the Internet Archive, so you can look at the CGM's I'm 
finding:

* [all 
copies|http://web.archive.org/web/240100*/http://www.agocg.ac.uk/Graphics/CGM/RALCGM/sample.cgm],
 or a [specific copy| 
http://web.archive.org/web/2226055607/http://www.agocg.ac.uk/Graphics/CGM/RALCGM/sample.cgm].
** Those example files now seem to be at 
http://www.agocg.ac.uk/train/cgm/examples/cgmindex.htm
* or [this specific 
item|http://web.archive.org/web/20050223100939/http://wwwcms.brookes.ac.uk:80/webmsc2004/p00770/cgms/flyboat.cgm]
 from [this folder 
here|http://web.archive.org/web/20050112031156/http://wwwcms.brookes.ac.uk/webmsc2004/p00770/cgms/]
* I also found these, but have not checked if any are binary 
http://www.fileformat.info/format/cgm/sample/index.htm

Unfortunately,the licensing may not be clear in these cases, so these test 
files may not be suitable. If anyone knows of any software that can write 
binary CGM files, I'm willing to give it a go.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor

 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1170:
-

Attachment: plotutils-example.cgm

This is an example version 3 binary CGM file, generated using GNU plotutils.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756064#comment-13756064
 ] 

Andrew Jackson commented on TIKA-1170:
--

I was able to create an example file, using [GNU 
plotutils|http://www.gnu.org/software/plotutils/] ('brew install plotutils'), 
as per [these 
instructions|http://www.gnu.org/software/plotutils/manual/en/plotutils.html#graph]
{code}
graph -T cgm  datafile  plot.cgm
{code}
I'll attach an example.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1170:
-

Attachment: 0001-Added-CGM-test-file-test-and-improved-magic.patch

Patch containing test file, test, and improved magic.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1154:


 Summary: Tika hangs on format detection of malformed HTML file.
 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor


We are using Tika on large web archives, which also happen to contain some 
malformed files. In particular, we found a HTML file with binary characters in 
the DOCTYPE declaration. This hangs Tika, either embedded or from the command 
line, during format detection.

An example file is attached.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1154:
-

Attachment: tika-breaker.html

This file makes tika hang. If you remove both of the binary characters (0x02 
0x00), then it starts working again.

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719513#comment-13719513
 ] 

Andrew Jackson commented on TIKA-1154:
--

Thanks for the stacktrace, which lead me to this mailing list entry:

http://mail-archives.apache.org/mod_mbox/tika-dev/201011.mbox/%3c5afe4d67-0c49-4947-94ba-f9b1f64ee...@transpac.com%3E

which suggest that upgrading Xerces will fix this.

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719594#comment-13719594
 ] 

Andrew Jackson commented on TIKA-1154:
--

We could exclude the package from coming in via the metadata-extractor 
dependency and include the later version as a top-level dependency, but if 
there have been significant API changes between 2.8.1 and 2.10.0 then this 
could cause problems.

I can submit an issue at 
https://code.google.com/p/metadata-extractor/issues/list and see if they're 
willing to upgrade?

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719631#comment-13719631
 ] 

Andrew Jackson commented on TIKA-1154:
--

Okay, I submitted an issue here:

https://code.google.com/p/metadata-extractor/issues/detail?id=85

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-06 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13429426#comment-13429426
 ] 

Andrew Jackson commented on TIKA-970:
-

Hi, I noticed the updated version includes a bit more information. In 
particular, the 'image/jpm' format is declared to have an alias of 'video/jpm'. 
This doesn't appear to be a registered MIME type, and I've not come across it 
before. Have you got any more information on this video format?

 Full identification of the JPEG 2000 family of formats
 --

 Key: TIKA-970
 URL: https://issues.apache.org/jira/browse/TIKA-970
 Project: Tika
  Issue Type: New Feature
  Components: mime
Affects Versions: 1.3
Reporter: Andrew Jackson
Assignee: Jukka Zitting
Priority: Minor
 Fix For: 1.3

 Attachments: custom-mimetype.xml


 Please find attached a suitable set of magic definitions for allowing Tika to 
 identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
 formats. It is based on the 'file' magic from 
 [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
 the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-970:


Attachment: custom-mimetype.xml

 Full identification of the JPEG 2000 family of formats
 --

 Key: TIKA-970
 URL: https://issues.apache.org/jira/browse/TIKA-970
 Project: Tika
  Issue Type: New Feature
  Components: mime
Affects Versions: 1.3
Reporter: Andrew Jackson
Priority: Minor
 Attachments: custom-mimetype.xml


 Please find attached a suitable set of magic definitions for allowing Tika to 
 identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
 formats. It is based on the 'file' magic from 
 [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
 the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428085#comment-13428085
 ] 

Andrew Jackson commented on TIKA-970:
-

BTW, this set of signatures rather clumsily repeats the overall container 
signature for each sub-format. I don't know if this can be avoided, but just 
removing the repeat and expecting the subclass relationship to work out the 
details did not seem to work reliably.

 Full identification of the JPEG 2000 family of formats
 --

 Key: TIKA-970
 URL: https://issues.apache.org/jira/browse/TIKA-970
 Project: Tika
  Issue Type: New Feature
  Components: mime
Affects Versions: 1.3
Reporter: Andrew Jackson
Priority: Minor
 Attachments: custom-mimetype.xml


 Please find attached a suitable set of magic definitions for allowing Tika to 
 identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
 formats. It is based on the 'file' magic from 
 [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
 the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428096#comment-13428096
 ] 

Andrew Jackson commented on TIKA-970:
-

I should be able to sort that out. I know the author and I know that the 
project the work has been done under defaults to the Apache 2 licence. I've 
asked him to make the licensing on the magic files clear.

 Full identification of the JPEG 2000 family of formats
 --

 Key: TIKA-970
 URL: https://issues.apache.org/jira/browse/TIKA-970
 Project: Tika
  Issue Type: New Feature
  Components: mime
Affects Versions: 1.3
Reporter: Andrew Jackson
Priority: Minor
 Attachments: custom-mimetype.xml


 Please find attached a suitable set of magic definitions for allowing Tika to 
 identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
 formats. It is based on the 'file' magic from 
 [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
 the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428108#comment-13428108
 ] 

Andrew Jackson commented on TIKA-970:
-

I assume I'll need him to confirm an Apache 2 licence? Or are there compatible 
licences for derivative works?

 Full identification of the JPEG 2000 family of formats
 --

 Key: TIKA-970
 URL: https://issues.apache.org/jira/browse/TIKA-970
 Project: Tika
  Issue Type: New Feature
  Components: mime
Affects Versions: 1.3
Reporter: Andrew Jackson
Priority: Minor
 Attachments: custom-mimetype.xml


 Please find attached a suitable set of magic definitions for allowing Tika to 
 identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
 formats. It is based on the 'file' magic from 
 [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
 the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428116#comment-13428116
 ] 

Andrew Jackson commented on TIKA-970:
-

He's added the Apache licence here: 
https://github.com/bitsgalore/jp2kMagic/blob/master/magic/jpeg2000Magic

It would still be handy to know if you'd accept similar derivatives of code 
under other licences in the future.

Thanks.

 Full identification of the JPEG 2000 family of formats
 --

 Key: TIKA-970
 URL: https://issues.apache.org/jira/browse/TIKA-970
 Project: Tika
  Issue Type: New Feature
  Components: mime
Affects Versions: 1.3
Reporter: Andrew Jackson
Priority: Minor
 Attachments: custom-mimetype.xml


 Please find attached a suitable set of magic definitions for allowing Tika to 
 identify JP2 containers, codestreams, and the JP2, JPF, JPM and MJ2 file 
 formats. It is based on the 'file' magic from 
 [here|https://github.com/bitsgalore/jp2kMagic], and has been tested against 
 the example files supplied on that site.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-900:
---

 Summary: Tika fails to detect ISO9660 disk images
 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor


I have been testing Tika's ability to identify ISO9660 disk image file systems, 
and discovered two problems. Firstly, the offset match matcher was wrong (37633 
instead of 37633). Secondly, and more seriously, it was impossible for that 
signaure to ever match, because the default buffer size was far to small. It is 
currently set to 8KB, and as this signature is some 36KB into the file, Tika 
could never find the match. The attached patch fixes the magic, and extends the 
buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-900:


Attachment: iso-image-detection.patch

Patch to increase buffer size and fix ISO image detection.

 Tika fails to detect ISO9660 disk images
 

 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor
 Attachments: iso-image-detection.patch


 I have been testing Tika's ability to identify ISO9660 disk image file 
 systems, and discovered two problems. Firstly, the offset match matcher was 
 wrong (37633 instead of 37633). Secondly, and more seriously, it was 
 impossible for that signaure to ever match, because the default buffer size 
 was far to small. It is currently set to 8KB, and as this signature is some 
 36KB into the file, Tika could never find the match. The attached patch fixes 
 the magic, and extends the buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-900:


Attachment: (was: iso-image-detection.patch)

 Tika fails to detect ISO9660 disk images
 

 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor
 Attachments: iso-image-detection.patch


 I have been testing Tika's ability to identify ISO9660 disk image file 
 systems, and discovered two problems. Firstly, the offset match matcher was 
 wrong (37633 instead of 37633). Secondly, and more seriously, it was 
 impossible for that signaure to ever match, because the default buffer size 
 was far to small. It is currently set to 8KB, and as this signature is some 
 36KB into the file, Tika could never find the match. The attached patch fixes 
 the magic, and extends the buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-900:


Attachment: iso-image-detection.patch

Patch to fix ISO image magic, and extended the buffer size so that the magic 
can match.

 Tika fails to detect ISO9660 disk images
 

 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor
 Attachments: iso-image-detection.patch


 I have been testing Tika's ability to identify ISO9660 disk image file 
 systems, and discovered two problems. Firstly, the offset match matcher was 
 wrong (37633 instead of 37633). Secondly, and more seriously, it was 
 impossible for that signaure to ever match, because the default buffer size 
 was far to small. It is currently set to 8KB, and as this signature is some 
 36KB into the file, Tika could never find the match. The attached patch fixes 
 the magic, and extends the buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-900:


Description: I have been testing Tika's ability to identify ISO9660 disk 
image file systems, and discovered two problems. Firstly, the offset match 
matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it 
was impossible for that signaure to ever match, because the default buffer size 
was far to small. It is currently set to 8KB, and as this signature is some 
36KB into the file, Tika could never find the match. The attached patch fixes 
the magic, and extends the buffer to 64KB.  (was: I have been testing Tika's 
ability to identify ISO9660 disk image file systems, and discovered two 
problems. Firstly, the offset match matcher was wrong (37633 instead of 37633). 
Secondly, and more seriously, it was impossible for that signaure to ever 
match, because the default buffer size was far to small. It is currently set to 
8KB, and as this signature is some 36KB into the file, Tika could never find 
the match. The attached patch fixes the magic, and extends the buffer to 64KB.)

Fixing a typo.

 Tika fails to detect ISO9660 disk images
 

 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor
 Attachments: iso-image-detection.patch


 I have been testing Tika's ability to identify ISO9660 disk image file 
 systems, and discovered two problems. Firstly, the offset match matcher was 
 wrong (37633 instead of 32769). Secondly, and more seriously, it was 
 impossible for that signaure to ever match, because the default buffer size 
 was far to small. It is currently set to 8KB, and as this signature is some 
 36KB into the file, Tika could never find the match. The attached patch fixes 
 the magic, and extends the buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259615#comment-13259615
 ] 

Andrew Jackson commented on TIKA-900:
-

I re-uploaded the patch as it had an extra format that is not necessary for 
this patch. Also I noticed a typo in my original issue description. The offset 
should be 32769, not 37633.

 Tika fails to detect ISO9660 disk images
 

 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor
 Attachments: iso-image-detection.patch


 I have been testing Tika's ability to identify ISO9660 disk image file 
 systems, and discovered two problems. Firstly, the offset match matcher was 
 wrong (37633 instead of 37633). Secondly, and more seriously, it was 
 impossible for that signaure to ever match, because the default buffer size 
 was far to small. It is currently set to 8KB, and as this signature is some 
 36KB into the file, Tika could never find the match. The attached patch fixes 
 the magic, and extends the buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-900:


Description: I have been testing Tika's ability to identify ISO9660 disk 
image file systems, and discovered two problems. Firstly, the offset match 
matcher was wrong (37633 instead of 32769). Secondly, and more seriously, it 
was impossible for that signaure to ever match, because the default buffer size 
was far too small. It is currently set to 8KB, and as this signature is some 
36KB into the file, Tika could never find the match. The attached patch fixes 
the magic, and extends the buffer to 64KB.  (was: I have been testing Tika's 
ability to identify ISO9660 disk image file systems, and discovered two 
problems. Firstly, the offset match matcher was wrong (37633 instead of 32769). 
Secondly, and more seriously, it was impossible for that signaure to ever 
match, because the default buffer size was far to small. It is currently set to 
8KB, and as this signature is some 36KB into the file, Tika could never find 
the match. The attached patch fixes the magic, and extends the buffer to 64KB.)

 Tika fails to detect ISO9660 disk images
 

 Key: TIKA-900
 URL: https://issues.apache.org/jira/browse/TIKA-900
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.1
 Environment: Any.
Reporter: Andrew Jackson
Priority: Minor
 Attachments: iso-image-detection.patch


 I have been testing Tika's ability to identify ISO9660 disk image file 
 systems, and discovered two problems. Firstly, the offset match matcher was 
 wrong (37633 instead of 32769). Secondly, and more seriously, it was 
 impossible for that signaure to ever match, because the default buffer size 
 was far too small. It is currently set to 8KB, and as this signature is some 
 36KB into the file, Tika could never find the match. The attached patch fixes 
 the magic, and extends the buffer to 64KB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira