At some point I extracted all emfs from our corpus. I’ll see if that data is still around and/or re-extract...prob have time tomorrow/ Wednesday
On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <dominik.stad...@gmx.at> wrote: > Hi Andi > > It is easy to change CommonCrawlDocumentDownload to fetch other mime-types, > see > https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf > > However .emf files don't appear in the top-100 mimetypes of the crawls and > thus are likely very rarely included if at all. I started a download-run, > but the first two of the 300 index-files do not contain any matching > extension or mime-type. > > See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for > mimetype-statistics in the crawl. > > Dominik. > > On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <kiwiwi...@apache.org> > wrote: > > > Hi Tim / Dominik, > > > > please give me a few pointers, how I could access a pool of EMF files, > > e.g. (not only) within the common crawl corpus. My focus is currently on > > rendering, but as I extend the supported records, I also like to validate > > the parsing. > > As the EMF parsing is relatively new, you still might have a corpus for > > it, Tim? > > > > I have a few old mails about the common crawl corpus [2], but I guess > > there has been some restructuring taken place and there might be an > easier > > option than downloading the whole index. > > > > Of course office files which I parse for embedded EMFs are also ok. > > > > I have to admit, that I haven't yet tested Dominiks tool [1]. > > > > Alternatively I can use the govdocs1 corpus [3] > > > > Best wishes, > > Andi > > > > > > [1] https://github.com/centic9/CommonCrawlDocumentDownload > > > > [2] > > > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html > > > > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/ > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org > > For additional commands, e-mail: dev-h...@poi.apache.org > > > > >