At some point I extracted all emfs from our corpus. I’ll see if that data
is still around and/or re-extract...prob have time tomorrow/ Wednesday

On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <dominik.stad...@gmx.at>
wrote:

> Hi Andi
>
> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
> see
> https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
>
> However .emf files don't appear in the top-100 mimetypes of the crawls and
> thus are likely very rarely included if at all. I started a download-run,
> but the first two of the 300 index-files do not contain any matching
> extension or mime-type.
>
> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
> mimetype-statistics in the crawl.
>
> Dominik.
>
> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <kiwiwi...@apache.org>
> wrote:
>
> > Hi Tim / Dominik,
> >
> > please give me a few pointers, how I could access a pool of EMF files,
> > e.g. (not only) within the common crawl corpus. My focus is currently on
> > rendering, but as I extend the supported records, I also like to validate
> > the parsing.
> > As the EMF parsing is relatively new, you still might have a corpus for
> > it, Tim?
> >
> > I have a few old mails about the common crawl corpus [2], but I guess
> > there has been some restructuring taken place and there might be an
> easier
> > option than downloading the whole index.
> >
> > Of course office files which I parse for embedded EMFs are also ok.
> >
> > I have to admit, that I haven't yet tested Dominiks tool [1].
> >
> > Alternatively I can use the govdocs1 corpus [3]
> >
> > Best wishes,
> > Andi
> >
> >
> > [1] https://github.com/centic9/CommonCrawlDocumentDownload
> >
> > [2]
> >
> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
> >
> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> > For additional commands, e-mail: dev-h...@poi.apache.org
> >
> >
>

Reply via email to