FW: Tika content detection and crawled "remote" content

Allison, Timothy B. Wed, 05 Jul 2017 05:19:52 -0700

All,

> If anyone is interested in using the detected MIME types or anything else 
> from Common Crawl - I'm happy to help!  The URL index [4] contains now a new 
> field "mime-detected" which makes it easy to search or grep for confusion 
> pairs.


This is an amazing step forward for sampling PDF files from Common Crawl.  I 
used to rely on the http-headers and/or file suffix, but now we also have 
Tika's judgment on every file in Common Crawl.

We still have to deal with the 1MB truncation (I think), but this is an amazing 
development.  Thank you, Sebastian!

Cheers,

             Tim

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Tuesday, July 4, 2017 6:18 AM
To: u...@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler 
(modified Nutch) with the target to get clean and correct MIME type - the HTTP 
Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the 
server in the HTTP header and as detected by Tika 1.15 [2].  It shows that 
content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from 
HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed 
picture: some pairs are plausible, e.g., if Tika changes the type to a more 
precise subtype or detects the MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side 
scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files 
unmodified but in general I wouldn't expect that this happens for millions of 
pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
    http://www.privi.com/product-details.asp?cno=C10910011
    http://mental-ray.de/Root_alt/Default.asp
    http://ekyrs.org/support/index.php?action=profile
    http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
    http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
    
https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
    https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
    http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple 
explanation)
    http://www.proedinc.com/customer/content.aspx?redid=9
    
http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
    http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
    http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared 
to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in 
Tika?

If anyone is interested in using the detected MIME types or anything else from 
Common Crawl - I'm happy to help!  The URL index [4] contains now a new field 
"mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] 
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

FW: Tika content detection and crawled "remote" content

Reply via email to