>I'd go for corpora.tika.apache.org too. Infra ticket updated. Thank you, all!
On Wed, Jun 3, 2020 at 2:07 AM Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > > > Am 02.06.20 um 23:29 schrieb Tim Allison: > > > https://issues.apache.org/jira/browse/INFRA-20372 > > > > > > On Slack, Gavin suggested something like corpora.tika.apache.org. I'm > > > happy with corpora.pdfbox.apache.org or anything else. Please let us > know > > > what you think over on that ticket. > > IMHO it should be either corpora.pdfbox.apache.org or > corpora.tika.apache.org. I > > would prefer the latter, as tika is the tools which is mainly used here. > > I'd go for corpora.tika.apache.org too. > BR > Maruan > > > > > Andreas > > > > > Thank you, again! > > > > > > Cheers, > > > > > > Tim > > > > > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <talli...@apache.org> > wrote: > > > > > > > > proper domain for https access > > > > > > > > I just pinged infra on slack. > > > > > > > > If they're able to do it, what would we want? > > > > > > > > file-corpora.apache.org > > > > corpora.apache.org > > > > corpora-pdfbox.apache.org > > > > corpora-tika.apache.org > > > > > > > > Something else? I'm also happy to buy a domain if that won't work. > There > > > > are a couple available that are close enough. > > > > > > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun < > sahy...@fileaffairs.de> > > > > wrote: > > > > > > > > > > AMD ryzen looks fantastic. Others would be great as well. > > > > > > > > > > > > If ubuntu is possible at all, that's what I've been working with > most > > > > > > recently. > > > > > > > > > > OK - will setup with that distro > > > > > > > > > > > Other than that, ssh access and sudo privileges would be all I'd > need. > > > > > > > > > > > > Are you ok if we set up apache httpd to host files for the > public or > > > > > will > > > > > > this be a community only resource? > > > > > > > > > > it can be used for whatever we want it to - so if you consider > public > > > > > file sharing useful of course we can do that. Would be > > > > > good if we get a proper domain for https access. Would that be > something > > > > > infra can do? > > > > > > > > > > > If this is corporate sponsored, please let me know how/if we > should > > > > > mention > > > > > > the sponsorship. > > > > > > > > > > no need to mention it - happy to help. > > > > > > > > > > > Again...wow. Thank you! > > > > > > > > > > > > Best, > > > > > > > > > > > > Tim > > > > > > > > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun < > sahy...@fileaffairs.de> > > > > > > wrote: > > > > > > > > > > > > > Could fund either: > > > > > > > > > > > > > > AMD Ryzen 5 3600 > > > > > > > 64 GB RAM > > > > > > > 2x2TB > > > > > > > > > > > > > > or > > > > > > > > > > > > > > AMD Ryzen 7 3700X based Server > > > > > > > 64 GB RAM > > > > > > > 2x8TB > > > > > > > > > > > > > > or > > > > > > > Intel® Core™ i9-9900K > > > > > > > 64 GB RAM > > > > > > > 2x8TB > > > > > > > > > > > > > > All are root servers so one has to vote for taking care of > them (I > > > > > can do > > > > > > > the initial setup). > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR > > > > > > > Maruan > > > > > > > > > > > > > > > There are two use cases. > > > > > > > > > > > > > > > > 1) host shared data so that we can all point to and work > from the > > > > > same > > > > > > > > data, ideally both literal docs and also extracts > (text/metadata > > > > > .json > > > > > > > > files representing extracted information). > > > > > > > > > > > > > > > > 2) a modest vm to allow all of us to run the regression tests > > > > > > > > > > > > > > > > We could use help with either or both. > > > > > > > > > > > > > > > > What we had before: > > > > > > > > 8 GB RAM > > > > > > > > 8 cores > > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging > > > > > > > > > > > > > > > > We can always use more RAM and more cores up to the point of > I/O > > > > > > > > bottlenecks. > > > > > > > > > > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun < > > > > > sahy...@fileaffairs.de> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > is that a storage box only or does it need to do some > computings > > > > > too? > > > > > > > > > Maybe you could write a small spec for the server > requirement? > > > > > > > > > > > > > > > > > > BR > > > > > > > > > Maruan > > > > > > > > > > > > > > > > > > > > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh. > > > > > > > > > > > > > > > > > > > > Yes, more than happy to share. > > > > > > > > > > > > > > > > > > > > If anyone has recommendations for file hosting for a > couple of > > > > > TB, > > > > > > > let me > > > > > > > > > > know. > > > > > > > > > > > > > > > > > > > > One option would be to work with CommonCrawl to bump the > max > > > > > file > > > > > > > size > > > > > > > > > one > > > > > > > > > > crawl a year... > > > > > > > > > > > > > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr < > > > > > > > thaush...@t-online.de> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Can we / I access these files? Most differences are > > > > > improvements > > > > > > > or not > > > > > > > > > > > meaningful, but there are a few I'd like to have a > look, e.g. > > > > > > > > > > > > > > > > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T > > > > > > > > > > > > > > > > > > > > > > the word "antrag" loses the first "a". Although maybe > the "a" > > > > > was > > > > > > > a big > > > > > > > > > > > one and gets assigned to another line. > > > > > > > > > > > > > > > > > > > > > > Tilman > > > > > > > > > > > > > > > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison: > > > > > > > > > > > > > > Reports are available here: > > > > > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz > > > > > > > > > > > > Looks like there are trivial differences in content > with a > > > > > slight > > > > > > > > > > > > improvement over 2.0.19. I don't see any > differences in > > > > > > > exceptions > > > > > > > > > or > > > > > > > > > > > > attachments. > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > > > > > > > Tim > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > To unsubscribe, e-mail: > dev-unsubscr...@pdfbox.apache.org > > > > > > > > > > > For additional commands, e-mail: > dev-h...@pdfbox.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Maruan Sahyoun > > > > > > > > > > > > > > FileAffairs GmbH > > > > > > > Josef-Schappe-Straße 21 > > > > > > > 40882 Ratingen > > > > > > > > > > > > > > Tel: +49 (2102) 89497 88 > > > > > > > Fax: +49 (2102) 89497 91 > > > > > > > sahy...@fileaffairs.de > > > > > > > www.fileaffairs.de > > > > > > > > > > > > > > Geschäftsführer: Maruan Sahyoun > > > > > > > Handelsregister: AG Düsseldorf, HRB 53837 > > > > > > > UST.-ID: DE248275827 > > > > > > > > > > > > > > > > > > > -- > > > > > Maruan Sahyoun > > > > > > > > > > FileAffairs GmbH > > > > > Josef-Schappe-Straße 21 > > > > > 40882 Ratingen > > > > > > > > > > Tel: +49 (2102) 89497 88 > > > > > Fax: +49 (2102) 89497 91 > > > > > sahy...@fileaffairs.de > > > > > www.fileaffairs.de > > > > > > > > > > Geschäftsführer: Maruan Sahyoun > > > > > Handelsregister: AG Düsseldorf, HRB 53837 > > > > > UST.-ID: DE248275827 > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > >