All, > If anyone is interested in using the detected MIME types or anything else > from Common Crawl - I'm happy to help! The URL index [4] contains now a new > field "mime-detected" which makes it easy to search or grep for confusion > pairs.
This is an amazing step forward for sampling PDF files from Common Crawl. I used to rely on the http-headers and/or file suffix, but now we also have Tika's judgment on every file in Common Crawl. We still have to deal with the 1MB truncation (I think), but this is an amazing development. Thank you, Sebastian! Cheers, Tim -----Original Message----- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Tuesday, July 4, 2017 6:18 AM To: u...@tika.apache.org Subject: Tika content detection and crawled "remote" content Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1]. For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers). A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all: Tika-1.15 HTTP-Content-Type 1001968023 application/xhtml+xml text/html 2298146 application/rss+xml text/xml 617435 application/rss+xml application/xml 613525 text/html unk 361525 application/xhtml+xml unk 297707 application/rdf+xml application/xml However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.): Tika-1.15 HTTP-Content-Type 2047739 text/x-php text/html 681629 text/asp text/html 193095 text/x-coldfusion text/html 172318 text/aspdotnet text/html 139033 text/x-jsp text/html 38415 text/x-cgi text/html 32092 text/x-php text/xml 18021 text/x-perl text/html Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs: - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag) https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0 http://www.privi.com/product-details.asp?cno=C10910011 http://mental-ray.de/Root_alt/Default.asp http://ekyrs.org/support/index.php?action=profile http://cwmorse.eu5.org/lineal/mostrar.php?contador=200 - (overlong) comment block at start of HTML which "masks" the HTML declaration http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24 http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6 https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php https://de.e-stories.org/categories.php?&lan=nl&art=p - HTML with some scripting fragments ("<?php?>") present: http://www.eco-ani-yao.org/shien/ - others are clearly HTML (looks more like a bug, at least, there is no simple explanation) http://www.proedinc.com/customer/content.aspx?redid=9 http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79 http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79 Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server. Now my question: where's the best place to fix this: in the crawler [3] or in Tika? If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs. Thanks and best, Sebastian [1] https://github.com/commoncrawl/nutch/issues/3 [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152 [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org