Am 02.06.20 um 23:29 schrieb Tim Allison:
https://issues.apache.org/jira/browse/INFRA-20372

On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
happy with corpora.pdfbox.apache.org or anything else.  Please let us know
what you think over on that ticket.
IMHO it should be either corpora.pdfbox.apache.org or corpora.tika.apache.org. I would prefer the latter, as tika is the tools which is mainly used here.

Andreas


Thank you, again!

Cheers,

              Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <talli...@apache.org> wrote:

proper domain for https access

I just pinged infra on slack.

If they're able to do it, what would we want?

file-corpora.apache.org
corpora.apache.org
corpora-pdfbox.apache.org
corpora-tika.apache.org

Something else?  I'm also happy to buy a domain if that won't work.  There
are a couple available that are close enough.

On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <sahy...@fileaffairs.de>
wrote:


AMD ryzen looks fantastic.  Others would be great as well.

If ubuntu is possible at all, that's what I've been working with most
recently.

OK - will setup with that distro


Other than that, ssh access and sudo privileges would be all I'd need.

Are you ok if we set up apache httpd to host files for the public or
will
this be a community only resource?

it can be used for whatever we want it to - so if you consider public
file sharing useful of course we can do that. Would be
good if we get a proper domain for https access. Would that be something
infra can do?


If this is corporate sponsored, please let me know how/if we should
mention
the sponsorship.

no need to mention it - happy to help.


Again...wow.  Thank you!

Best,

       Tim

On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <sahy...@fileaffairs.de>
wrote:

Could fund either:

AMD Ryzen 5 3600
64 GB RAM
2x2TB

or

AMD Ryzen 7 3700X based Server
64 GB RAM
2x8TB

or
Intel® Core™ i9-9900K
64 GB RAM
2x8TB

All are root servers so one has to vote for taking care of them (I
can do
the initial setup).



BR
Maruan

There are two use cases.

1) host shared data so that we can all point to and work from the
same
data, ideally both literal docs and also extracts (text/metadata
.json
files representing extracted information).

2) a modest vm to allow all of us to run the regression tests

We could use help with either or both.

What we had before:
8 GB RAM
8 cores
4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging

We can always use more RAM and more cores up to the point of I/O
bottlenecks.

On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
sahy...@fileaffairs.de>
wrote:

is that a storage box only or does it need to do some computings
too?

Maybe you could write a small spec for the server requirement?

BR
Maruan


Still haven’t had time to put the server in a dmz. Ugh.

  Yes, more than happy to share.

If anyone has recommendations for file hosting for a couple of
TB,
let me
know.

One option would be to work with CommonCrawl to bump the max
file
size
one
crawl a year...

On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
thaush...@t-online.de>
wrote:

Can we / I access these files? Most differences are
improvements
or not
meaningful, but there are a few I'd like to have a look, e.g.

commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T

the word "antrag" loses the first "a". Although maybe the "a"
was
a big
one and gets assigned to another line.

Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:
Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
Looks like there are trivial differences in content with a
slight
improvement over 2.0.19.  I don't see any differences in
exceptions
or
attachments.

Cheers,

          Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to