Hi, These aggregations of large real world sets are always interesting to look through. Especially because they are bound to have a lot of garbage and peculiarities. There are probably some badly chosen key names, and very likely many programming errors.
Some interesting examples: what is this: Выберите_расширение_для_паковки the usual mixing of double-colon variants, there are also many escaped quotes: ”keywords” and \"keywords\" these two are identical, but given a large enough set, they might not be: height 512205 width 512205 mboxparser spews out a lot of garbage, incredible: MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3 MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3 MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3 really, it does: MboxParser-_blank">http 3 MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3 MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3 non-Latin scripts are expected, this is simplified Chinese: if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?)) perhaps shortest possible key name: T 4 mboxparser, again, this time with XML tags: MboxParser-ype>state</span></font></st1:placetype></st1 4 MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4 the set seems to contain stuff from adult sites: xhamster-site-verification for some reason, the Dutch government always pops up in large sets: custom:OVERHEID.Informatietype/DC.type 13 custom:OVERHEID.Organisatietype/OVERHEID.organisationType 13 there are 18 different ways to spell/use Content-Type, of which four are, of course, with mboxparser: Content-Type 6612729 content_type 14 \"Content-Type\" 9 \"content-type\" 5 the inevitable encoding error: pdf:docinfo:custom:-ý§ Q 10 pagerankâ„¢ 50 what.is.this: Laisv371DiskusijuIrK363rybosForumas 4 hey, another contenter for the shortest key name: M 4 there are 67 unique dcterms key names, but their counts are not very high: DCTERMS.title 44 dcterms.title 26 dcterms:title 13 dcterms.Title 3 there is also a Content-Type in Russian: Тип-содержимое 3 someone wants to remove your dust: Dust_Removal_Data 339 there are 908 unique unknown tags, no idea what that is: Exif_IFD0:Unknown_tag_(0x8482) 36 Unknown_tag_(0x00bf) 36 Exif_SubIFD:Unknown_tag_(0x9009) 35 Unknown_tag_(0x00a0) 35 Unknown_tag_(0x050e) 35 ah, the winner of the shortest key name (line 2235): 71 longest key, guess who: MboxParser- http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps 3 Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six most frequently used Arabic symbols are not present. I wonder why. But there is an RTL-script present, Hebrew. It is always strange to meet terms/wors of RTL-scripts in an otherwise general LTR-world. I was a bit disappointed not to find any obscene terms. The set seemed to be large enough for at least some general curse words. MboxParser is the real winner with 1763 unique keys, this is really absurd! Thanks, this was fun! Markus Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <talli...@apache.org>: > All, > > I recently extracted metadata keys from 1 million files in our > regression corpus and did a group by. This allows insight into common > metadata keys. > > I've included two views, one looks at overall counts, and the other > breaks down metadata keys by mime type. > > Please let us know if you find anything interesting or have any > questions. > > https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz > https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz > > Best, > > Tim >