RE: [COMPRESS] TIFF file identified as TAR

2018-02-28 Thread Allison, Timothy B.
As always, thank you, Stefan!

We might add a kluge at the Tika level to check for TIFF first...unless you'd 
like that kluge in your code? 😉

The reporter recommended one option: a conditional that checked the tarHeader 
variable to see if it started with one of the TIFF magic numbers (II/MM 49 49 
2A 00 / 4D 4D 00 2A).



-Original Message-
From: Stefan Bodewig [mailto:bode...@apache.org] 
Sent: Tuesday, February 27, 2018 3:46 PM
To: Stefan Bodewig 
Cc: Allison, Timothy B. ; Commons Developers List 

Subject: Re: [COMPRESS] TIFF file identified as TAR

On 2018-02-27, Stefan Bodewig wrote:

> On 2018-02-27, Allison, Timothy B. wrote:

>>On TIKA-2591[0], a user reports that a specific type of TIFF is
>>being identified as a TAR file.  Is this something we should try to
>>fix at the Tika level, or is this something that would be better
>>fixed in COMPRESS?

> TAR auto-detection is, erm, clumsy. But this is due to the format not 
> being built for being detected.

> This is how it works right now:

> * read the first candidate header of 512 bytes

> * look at the eight bytes that contain the "ustar" string and the
>   version and verify they look like something we support.

> * verify the checksum of the candidate tar header

Actually I was mis-reading the code. It is either "ustar and version look good" 
or "parses as tar header with correct checksum". So the chance for false 
positives is bigger.

Unfortunately this has proven necessary to detect all valid TAR
archives: https://issues.apache.org/jira/browse/COMPRESS-117

Stefan

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org




[COMPRESS] TIFF file identified as TAR

2018-02-27 Thread Allison, Timothy B.
COMPRESS colleagues,
   On TIKA-2591[0], a user reports that a specific type of TIFF is being 
identified as a TAR file.  Is this something we should try to fix at the Tika 
level, or is this something that would be better fixed in COMPRESS?
   Thank you!

   Best,

   Tim

[0] https://issues.apache.org/jira/browse/TIKA-2591

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org


[compress] differences in implementation of Zip ibm vs. oracle?

2017-07-10 Thread Allison, Timothy B.
Compress colleagues,

  Over on https://bz.apache.org/bugzilla/show_bug.cgi?id=61275, a user 
submitted two .xlsx files generated with Apache POI, one by IBM's jvm and one 
by Oracle's jvm.  The file generated with Oracle's jvm opens without issue; 
however, MSOffice complains but can fix the file generated by IBM's jvm.  
Winzip opens both without complaining.

  Does this ring a bell?  Have you seen this before?  Anything we can do on our 
(POI's) side to fix this?

   Thank you.

 Best,

   Tim


[compress] FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
Fellow file-philes on [compress],
  
Sebastian Nagel has added file type id via Apache Tika to Common Crawl.  While 
Tika is not 100% accurate, this means that we have far better clarity on mime 
type than relying on the http header+file suffix.  So, for testing purposes, 
you (or we over on Tika) can much more easily gather a small test corpus of 
files by mime type.

Many, many thanks to Sebastian and Common Crawl!

  Cheers,

  Tim

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Tuesday, July 4, 2017 6:18 AM
To: u...@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler 
(modified Nutch) with the target to get clean and correct MIME type - the HTTP 
Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the 
server in the HTTP header and as detected by Tika 1.15 [2].  It shows that 
content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from 
HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed 
picture: some pairs are plausible, e.g., if Tika changes the type to a more 
precise subtype or detects the MIME at all:

Tika-1.15HTTP-Content-Type
1001968023  application/xhtml+xmltext/html
   2298146  application/rss+xml  text/xml
617435  application/rss+xml  application/xml
613525  text/htmlunk
361525  application/xhtml+xmlunk
297707  application/rdf+xml  application/xml


However, there are a few dubious decisions, esp. the group of web server-side 
scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

 Tika-1.15 HTTP-Content-Type
2047739  text/x-phptext/html
 681629  text/asp  text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnettext/html
 139033  text/x-jsptext/html
  38415  text/x-cgitext/html
  32092  text/x-phptext/xml
  18021  text/x-perl   text/html

Of course, due to misconfigurations some servers may deliver the script files 
unmodified but in general I wouldn't expect that this happens for millions of 
pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of  or  opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6

https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("") present:
http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple 
explanation)
http://www.proedinc.com/customer/content.aspx?redid=9

http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared 
to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in 
Tika?

If anyone is interested in using the detected MIME types or anything else from 
Common Crawl - I'm happy to help!  The URL index [4] contains now a new field 
"mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] 
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/



RE: [COMPRESS] zip-bomb prevention for Z?

2017-04-14 Thread Allison, Timothy B.
>enum wouldn't work for formats added via ServiceLoader. LZO supports a couple 
>of names of its own and you couldn't inject them into the enum.

Doh!  Got it.  New code base...Sorry.

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



RE: [COMPRESS] zip-bomb prevention for Z?

2017-04-14 Thread Allison, Timothy B.
>> If there is anything COMPRESS can do to detect and avoid the situation, then 
>> please open an issue over here.

Done: COMPRESS-385, PR submitted

>> If we wanted to add such a method, what would the return value be? One of 
>> the String constants contained inside the *Factory classes, likely. Tika 
>> would have to be prepared for new strings popping up when using a newer 
>> version of Compress (1.14 will add "lz4-framed" for example).

Y, I'm ok with a String...perhaps longer term or for 2.0, move to an enum?  
Thank you for the heads-up!

I opened COMPRESS-386 to discuss adding a threshold for the table size.

As always, thank you!

Cheers,

  Tim


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



[COMPRESS] zip-bomb prevention for Z?

2017-04-13 Thread Allison, Timothy B.
On TIKA-1631 [1], users have observed that a corrupt Z file can cause an OOM at 
Internal_.InternalLZWStream.initializeTable.  Should we try to protect against 
this at the Tika level, or should we open an issue on commons-compress's JIRA?  

A second question, we're creating a stream with the CompressorStreamFactory 
when all we want to do is detect.  Is there a recommended way to detect the 
type of compressor without creating a stream?

Thank you!

Best,

 Tim

[1] https://issues.apache.org/jira/browse/TIKA-1631


[COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-07 Thread Allison, Timothy B.
All,

  We just heard back from a very active member of Common Crawl.  I don’t want 
to clog up our dev lists with this discussion (more than I have!), but I do 
want to invite all to participate in the discussion, planning and potential 
patches.

  If you’d like to participate, please join us here: 
https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

  I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the 
Subject line.  Please invite others who might have an interest in this work.

 Best,

 Tim

From: Allison, Timothy B.
Sent: Tuesday, April 07, 2015 8:39 AM
To: 'Stephen Merity'; common-cr...@googlegroups.com
Subject: RE: Any interest in running Apache Tika as part of CommonCrawl?

Stephen,

  Thank you very much for responding so quickly and for all of your work on 
Common Crawl.  I don’t want to speak for all of us, but given the feedback I’ve 
gotten so far from some of the dev communities, I think we would very much 
appreciate the chance to be tested on a monthly basis as part of the regular 
Common Crawl process.

   I think we’ll still want to run more often in our own sandbox(es) on the 
slice of CommonCrawl we have, but the monthly testing against new data, from my 
perspective at least, would be a huge win for all of us.

   In addition to parsing binaries and extracting text, Tika (via PDFBox, POI 
and many others) can also offer metadata (e.g. exif from images), which users 
of CommonCrawl might find of use.

  I’ll forward this to some of the relevant dev lists to invite others to 
participate in the discussion on the common-crawl list.


  Thank you, again.  I very much look forward to collaborating.

 Best,

 Tim

From: Stephen Merity [mailto:step...@commoncrawl.org]
Sent: Tuesday, April 07, 2015 3:57 AM
To: common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com>
Cc: mattm...@apache.org<mailto:mattm...@apache.org>; 
talli...@apache.org<mailto:talli...@apache.org>; 
dmei...@apache.org<mailto:dmei...@apache.org>; 
til...@apache.org<mailto:til...@apache.org>; 
n...@apache.org<mailto:n...@apache.org>
Subject: Re: Any interest in running Apache Tika as part of CommonCrawl?

Hi Tika team!

We'd certainly be interested in working with Apache Tika on such an 
undertaking. At the very least, we're glad that Julien has provided you with 
content to battle test Tika with!

As you've noted, the text extraction performed to produce WET files are focused 
primarily on HTML files, leaving many other file types not covered. The 
existing text extraction is quite efficient and part of the same process that 
generates the WAT file, meaning there's next to no overhead. Performing 
extraction with Tika at the scale of Common Crawl would be an interesting 
challenge. Running it as a once off wouldn't likely be too much of a challenge 
and would also give Tika the benefit of a wider variety of documents (both well 
formed and malformed) to test against. Running it on a frequent basis or as 
part of the crawl pipeline would be more challenging but something we can 
certainly discuss, especially if there's strong community desire for it!

On Fri, Apr 3, 2015 at 5:23 AM, 
mailto:tallison314...@gmail.com>> wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

  Tim
--
You received this message because you are subscribed to the Google Groups 
"Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
common-crawl+unsubscr...@googlegroups.com<mailto:common-crawl+unsubscr...@googlegroups.com>.
To post to this group, send email to 
common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com>.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--
Regards,
Stephen Merity
Data Scientist @ Common Crawl