>I'd go for corpora.tika.apache.org too.

Infra ticket updated.  Thank you, all!

On Wed, Jun 3, 2020 at 2:07 AM Maruan Sahyoun <sahy...@fileaffairs.de>
wrote:

>
> > Am 02.06.20 um 23:29 schrieb Tim Allison:
> > > https://issues.apache.org/jira/browse/INFRA-20372
> > >
> > > On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
> > > happy with corpora.pdfbox.apache.org or anything else.  Please let us
> know
> > > what you think over on that ticket.
> > IMHO it should be either corpora.pdfbox.apache.org or
> corpora.tika.apache.org. I
> > would prefer the latter, as tika is the tools which is mainly used here.
>
> I'd go for corpora.tika.apache.org too.
> BR
> Maruan
>
> >
> > Andreas
> >
> > > Thank you, again!
> > >
> > > Cheers,
> > >
> > >               Tim
> > >
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <talli...@apache.org>
> wrote:
> > >
> > > > > proper domain for https access
> > > >
> > > > I just pinged infra on slack.
> > > >
> > > > If they're able to do it, what would we want?
> > > >
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > >
> > > > Something else?  I'm also happy to buy a domain if that won't work.
> There
> > > > are a couple available that are close enough.
> > > >
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <
> sahy...@fileaffairs.de>
> > > > wrote:
> > > >
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > >
> > > > > > If ubuntu is possible at all, that's what I've been working with
> most
> > > > > > recently.
> > > > >
> > > > > OK - will setup with that distro
> > > > >
> > > > > > Other than that, ssh access and sudo privileges would be all I'd
> need.
> > > > > >
> > > > > > Are you ok if we set up apache httpd to host files for the
> public or
> > > > > will
> > > > > > this be a community only resource?
> > > > >
> > > > > it can be used for whatever we want it to - so if you consider
> public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be
> something
> > > > > infra can do?
> > > > >
> > > > > > If this is corporate sponsored, please let me know how/if we
> should
> > > > > mention
> > > > > > the sponsorship.
> > > > >
> > > > > no need to mention it - happy to help.
> > > > >
> > > > > > Again...wow.  Thank you!
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > >        Tim
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <
> sahy...@fileaffairs.de>
> > > > > > wrote:
> > > > > >
> > > > > > > Could fund either:
> > > > > > >
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > >
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > >
> > > > > > > All are root servers so one has to vote for taking care of
> them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > BR
> > > > > > > Maruan
> > > > > > >
> > > > > > > > There are two use cases.
> > > > > > > >
> > > > > > > > 1) host shared data so that we can all point to and work
> from the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts
> (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > >
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > >
> > > > > > > > We could use help with either or both.
> > > > > > > >
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > >
> > > > > > > > We can always use more RAM and more cores up to the point of
> I/O
> > > > > > > > bottlenecks.
> > > > > > > >
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahy...@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > is that a storage box only or does it need to do some
> computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server
> requirement?
> > > > > > > > >
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > >
> > > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > >
> > > > > > > > > > If anyone has recommendations for file hosting for a
> couple of
> > > > > TB,
> > > > > > > let me
> > > > > > > > > > know.
> > > > > > > > > >
> > > > > > > > > > One option would be to work with CommonCrawl to bump the
> max
> > > > > file
> > > > > > > size
> > > > > > > > > one
> > > > > > > > > > crawl a year...
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > > thaush...@t-online.de>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Can we / I access these files? Most differences are
> > > > > improvements
> > > > > > > or not
> > > > > > > > > > > meaningful, but there are a few I'd like to have a
> look, e.g.
> > > > > > > > > > >
> > > > > > > > > > >
> commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > > >
> > > > > > > > > > > the word "antrag" loses the first "a". Although maybe
> the "a"
> > > > > was
> > > > > > > a big
> > > > > > > > > > > one and gets assigned to another line.
> > > > > > > > > > >
> > > > > > > > > > > Tilman
> > > > > > > > > > >
> > > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > > > Reports are available here:
> > > > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > > > Looks like there are trivial differences in content
> with a
> > > > > slight
> > > > > > > > > > > > improvement over 2.0.19.  I don't see any
> differences in
> > > > > > > exceptions
> > > > > > > > > or
> > > > > > > > > > > > attachments.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > >
> > > > > > > > > > > >           Tim
> > > > > > > > > > > >
> > > > > > >
> ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> dev-unsubscr...@pdfbox.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> dev-h...@pdfbox.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > --
> > > > > > > Maruan Sahyoun
> > > > > > >
> > > > > > > FileAffairs GmbH
> > > > > > > Josef-Schappe-Straße 21
> > > > > > > 40882 Ratingen
> > > > > > >
> > > > > > > Tel: +49 (2102) 89497 88
> > > > > > > Fax: +49 (2102) 89497 91
> > > > > > > sahy...@fileaffairs.de
> > > > > > > www.fileaffairs.de
> > > > > > >
> > > > > > > Geschäftsführer: Maruan Sahyoun
> > > > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > > > UST.-ID: DE248275827
> > > > > > >
> > > > > > >
> > > > > --
> > > > > Maruan Sahyoun
> > > > >
> > > > > FileAffairs GmbH
> > > > > Josef-Schappe-Straße 21
> > > > > 40882 Ratingen
> > > > >
> > > > > Tel: +49 (2102) 89497 88
> > > > > Fax: +49 (2102) 89497 91
> > > > > sahy...@fileaffairs.de
> > > > > www.fileaffairs.de
> > > > >
> > > > > Geschäftsführer: Maruan Sahyoun
> > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > UST.-ID: DE248275827
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > >
> > > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>

Reply via email to