Hi,

These aggregations of large real world sets are always interesting to look
through. Especially because they are bound to have a lot of garbage and
peculiarities. There are probably some badly chosen key names, and very
likely many programming errors.

Some interesting examples:

what is this:
Выберите_расширение_для_паковки

the usual mixing of double-colon variants, there are also many escaped
quotes:
”keywords” and \"keywords\"

these two are identical, but given a large enough set, they might not be:
height 512205
width 512205

mboxparser spews out a lot of garbage, incredible:
MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3

really, it does:
MboxParser-_blank">http 3
MboxParser-a-aa-azzzzzzz-azzzzazzzzz-azzzzzzzazzzzzzzzzazzzzzazzzzzzzzzaz 3
MboxParser-a-aa-azzzzzzz-azzzzzzzz-azzzzazzzzzazzzzzzazzzzzzz 3

non-Latin scripts are expected, this is simplified Chinese:
if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))

perhaps shortest possible key name:
T 4

mboxparser, again, this time with XML tags:
MboxParser-ype>state</span></font></st1:placetype></st1 4
MboxParser-ype>university</span></font></st1:placetype></st1:place></st1 4

the set seems to contain stuff from adult sites:
xhamster-site-verification

for some reason, the Dutch government always pops up in large sets:
custom:OVERHEID.Informatietype/DC.type  13
custom:OVERHEID.Organisatietype/OVERHEID.organisationType       13

there are 18 different ways to spell/use Content-Type, of which four are,
of course, with mboxparser:
Content-Type    6612729
content_type    14
\"Content-Type\"        9
\"content-type\"        5

the inevitable encoding error:
pdf:docinfo:custom:-ý§ Q 10
pagerankâ„¢ 50

what.is.this:
Laisv371DiskusijuIrK363rybosForumas 4

hey, another contenter for the shortest key name:
M 4

there are 67 unique dcterms key names, but their counts are not very high:
DCTERMS.title   44
dcterms.title   26
dcterms:title   13
dcterms.Title   3

there is also a Content-Type in Russian:
Тип-содержимое 3

someone wants to remove your dust:
Dust_Removal_Data 339

there are 908 unique unknown tags, no idea what that is:
Exif_IFD0:Unknown_tag_(0x8482)  36
Unknown_tag_(0x00bf)    36
Exif_SubIFD:Unknown_tag_(0x9009)        35
Unknown_tag_(0x00a0)    35
Unknown_tag_(0x050e)    35

ah, the winner of the shortest key name (line 2235):
71

longest key, guess who:
MboxParser-
http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
       3

Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six
most frequently used Arabic symbols are not present. I wonder why. But
there is an RTL-script present, Hebrew. It is always strange to meet
terms/wors of RTL-scripts in an otherwise general LTR-world.

I was a bit disappointed not to find any obscene terms. The set seemed to
be large enough for at least some general curse words.

MboxParser is the real winner with 1763 unique keys, this is really absurd!

Thanks, this was fun!
Markus

Op ma 3 okt. 2022 om 15:26 schreef Tim Allison <talli...@apache.org>:

> All,
>
>   I recently extracted metadata keys from 1 million files in our
> regression corpus and did a group by.  This allows insight into common
> metadata keys.
>
>   I've included two views, one looks at overall counts, and the other
> breaks down metadata keys by mime type.
>
>   Please let us know if you find anything interesting or have any
> questions.
>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>
>    Best,
>
>             Tim
>

Reply via email to