> The initial intention is, of course, to help to improve the MIME detection in
> Tika core.
Absolutely agree.
> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
Agreed, as I look at the numbers they aren't huge, but the improvement for our
test corpus development is fantastic. Even a few thousand extra docx, for
example, will help.
My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files.
Common Crawl is truncating at 1MB, right?
Again, WOW!!! Thank you!!!
Cheers,
Tim
-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <[email protected]>
Cc: [email protected]; POI Developers List ([email protected])
<[email protected]>
Subject: Re: FW: Tika content detection and crawled "remote" content
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:
Tika-1.15 HTTP-Content-Type
12520 application/x-tika-msoffice application/octet-stream
6681 application/x-tika-ooxml application/octet-stream
3793 application/x-tika-msoffice text/plain
3515 application/x-tika-msoffice application/force-download
2259 application/x-tika-ooxml application/msword
1911 application/x-tika-msoffice unk
1314 application/x-tika-msoffice application/download
1259 application/x-tika-ooxml unk
1068 application/x-tika-ooxml application/force-download
711 application/x-tika-msoffice file/unknown
...
The initial intention is, of course, to help to improve the MIME detection in
Tika core.
Among the detected office formats there is one conspicuous pair:
127 application/msword text/vnd.graphviz
Looks like *.dot is taken as indicator only for MSWord documents.
Let me know if I can help to extract any data sets!
Thanks,
Sebastian
On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:
> Dominik,
> Thanks to Sebastian and CommonCrawl, this means that we can now have far
> better precision and recall in selecting only MSOffice docs for our
> regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: [email protected]
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler
> (modified Nutch) with the target to get clean and correct MIME type - the
> HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2]. It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings"
> from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed
> picture: some pairs are plausible, e.g., if Tika changes the type to a more
> precise subtype or detects the MIME at all:
>
> Tika-1.15 HTTP-Content-Type
> 1001968023 application/xhtml+xml text/html
> 2298146 application/rss+xml text/xml
> 617435 application/rss+xml application/xml
> 613525 text/html unk
> 361525 application/xhtml+xml unk
> 297707 application/rdf+xml application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side
> scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
> Tika-1.15 HTTP-Content-Type
> 2047739 text/x-php text/html
> 681629 text/asp text/html
> 193095 text/x-coldfusion text/html
> 172318 text/aspdotnet text/html
> 139033 text/x-jsp text/html
> 38415 text/x-cgi text/html
> 32092 text/x-php text/xml
> 18021 text/x-perl text/html
>
> Of course, due to misconfigurations some servers may deliver the script files
> unmodified but in general I wouldn't expect that this happens for millions of
> pages. I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
> http://www.privi.com/product-details.asp?cno=C10910011
> http://mental-ray.de/Root_alt/Default.asp
> http://ekyrs.org/support/index.php?action=profile
> http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
> http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
> https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
> http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no
> simple explanation)
> http://www.proedinc.com/customer/content.aspx?redid=9
>
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
> http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared
> to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in
> Tika?
>
> If anyone is interested in using the detected MIME types or anything else
> from Common Crawl - I'm happy to help! The URL index [4] contains now a new
> field "mime-detected" which makes it easy to search or grep for confusion
> pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]