Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-03 Thread Tim Allison
>I'd go for corpora.tika.apache.org too.

Infra ticket updated.  Thank you, all!

On Wed, Jun 3, 2020 at 2:07 AM Maruan Sahyoun 
wrote:

>
> > Am 02.06.20 um 23:29 schrieb Tim Allison:
> > > https://issues.apache.org/jira/browse/INFRA-20372
> > >
> > > On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
> > > happy with corpora.pdfbox.apache.org or anything else.  Please let us
> know
> > > what you think over on that ticket.
> > IMHO it should be either corpora.pdfbox.apache.org or
> corpora.tika.apache.org. I
> > would prefer the latter, as tika is the tools which is mainly used here.
>
> I'd go for corpora.tika.apache.org too.
> BR
> Maruan
>
> >
> > Andreas
> >
> > > Thank you, again!
> > >
> > > Cheers,
> > >
> > >   Tim
> > >
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison 
> wrote:
> > >
> > > > > proper domain for https access
> > > >
> > > > I just pinged infra on slack.
> > > >
> > > > If they're able to do it, what would we want?
> > > >
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > >
> > > > Something else?  I'm also happy to buy a domain if that won't work.
> There
> > > > are a couple available that are close enough.
> > > >
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <
> sahy...@fileaffairs.de>
> > > > wrote:
> > > >
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > >
> > > > > > If ubuntu is possible at all, that's what I've been working with
> most
> > > > > > recently.
> > > > >
> > > > > OK - will setup with that distro
> > > > >
> > > > > > Other than that, ssh access and sudo privileges would be all I'd
> need.
> > > > > >
> > > > > > Are you ok if we set up apache httpd to host files for the
> public or
> > > > > will
> > > > > > this be a community only resource?
> > > > >
> > > > > it can be used for whatever we want it to - so if you consider
> public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be
> something
> > > > > infra can do?
> > > > >
> > > > > > If this is corporate sponsored, please let me know how/if we
> should
> > > > > mention
> > > > > > the sponsorship.
> > > > >
> > > > > no need to mention it - happy to help.
> > > > >
> > > > > > Again...wow.  Thank you!
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > >Tim
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <
> sahy...@fileaffairs.de>
> > > > > > wrote:
> > > > > >
> > > > > > > Could fund either:
> > > > > > >
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > >
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > >
> > > > > > > All are root servers so one has to vote for taking care of
> them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > BR
> > > > > > > Maruan
> > > > > > >
> > > > > > > > There are two use cases.
> > > > > > > >
> > > > > > > > 1) host shared data so that we can all point to and work
> from the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts
> (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > >
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > >
> > > > > > > > We could use help with either or both.
> > > > > > > >
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > >
> > > > > > > > We can always use more RAM and more cores up to the point of
> I/O
> > > > > > > > bottlenecks.
> > > > > > > >
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahy...@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > is that a storage box only or does it need to do some
> computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server
> requirement?
> > > > > > > > >
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > >
> > > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > >
> > > > > > > > > > If anyone has recommendations for file hosting for a
> couple of
> > > > > TB,
> > > > > > > let me
> > > > > > > > > > know.
> > > > > > > > > >
> > > > > > > > > > One option would be to work with CommonCrawl to bump the
> max
> > > > > file
> > > > > > > size
> > > > > > > > > one
> > > > > > > > > > crawl a year...
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman 

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-03 Thread Maruan Sahyoun
 
> Am 02.06.20 um 23:29 schrieb Tim Allison:
> > https://issues.apache.org/jira/browse/INFRA-20372
> > 
> > On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
> > happy with corpora.pdfbox.apache.org or anything else.  Please let us know
> > what you think over on that ticket.
> IMHO it should be either corpora.pdfbox.apache.org or 
> corpora.tika.apache.org. I 
> would prefer the latter, as tika is the tools which is mainly used here.

I'd go for corpora.tika.apache.org too.
BR
Maruan

> 
> Andreas
> 
> > Thank you, again!
> > 
> > Cheers,
> > 
> >   Tim
> > 
> > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:
> > 
> > > > proper domain for https access
> > > 
> > > I just pinged infra on slack.
> > > 
> > > If they're able to do it, what would we want?
> > > 
> > > file-corpora.apache.org
> > > corpora.apache.org
> > > corpora-pdfbox.apache.org
> > > corpora-tika.apache.org
> > > 
> > > Something else?  I'm also happy to buy a domain if that won't work.  There
> > > are a couple available that are close enough.
> > > 
> > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> > > wrote:
> > > 
> > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > 
> > > > > If ubuntu is possible at all, that's what I've been working with most
> > > > > recently.
> > > > 
> > > > OK - will setup with that distro
> > > > 
> > > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > > 
> > > > > Are you ok if we set up apache httpd to host files for the public or
> > > > will
> > > > > this be a community only resource?
> > > > 
> > > > it can be used for whatever we want it to - so if you consider public
> > > > file sharing useful of course we can do that. Would be
> > > > good if we get a proper domain for https access. Would that be something
> > > > infra can do?
> > > > 
> > > > > If this is corporate sponsored, please let me know how/if we should
> > > > mention
> > > > > the sponsorship.
> > > > 
> > > > no need to mention it - happy to help.
> > > > 
> > > > > Again...wow.  Thank you!
> > > > > 
> > > > > Best,
> > > > > 
> > > > >Tim
> > > > > 
> > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> > > > > wrote:
> > > > > 
> > > > > > Could fund either:
> > > > > > 
> > > > > > AMD Ryzen 5 3600
> > > > > > 64 GB RAM
> > > > > > 2x2TB
> > > > > > 
> > > > > > or
> > > > > > 
> > > > > > AMD Ryzen 7 3700X based Server
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > > 
> > > > > > or
> > > > > > Intel® Core™ i9-9900K
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > > 
> > > > > > All are root servers so one has to vote for taking care of them (I
> > > > can do
> > > > > > the initial setup).
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > BR
> > > > > > Maruan
> > > > > > 
> > > > > > > There are two use cases.
> > > > > > > 
> > > > > > > 1) host shared data so that we can all point to and work from the
> > > > same
> > > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > > .json
> > > > > > > files representing extracted information).
> > > > > > > 
> > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > 
> > > > > > > We could use help with either or both.
> > > > > > > 
> > > > > > > What we had before:
> > > > > > > 8 GB RAM
> > > > > > > 8 cores
> > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > 
> > > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > > bottlenecks.
> > > > > > > 
> > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > sahy...@fileaffairs.de>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > is that a storage box only or does it need to do some computings
> > > > too?
> > > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > > 
> > > > > > > > BR
> > > > > > > > Maruan
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > 
> > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > 
> > > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > > TB,
> > > > > > let me
> > > > > > > > > know.
> > > > > > > > > 
> > > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > > file
> > > > > > size
> > > > > > > > one
> > > > > > > > > crawl a year...
> > > > > > > > > 
> > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > thaush...@t-online.de>
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > > Can we / I access these files? Most differences are
> > > > improvements
> > > > > > or not
> > > > > > > > > > meaningful, but there are a few I'd like to have a look, 
> > > > > > > > > > e.g.
> > > > > > > > > > 
> > > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > > 
> > > > > > > > > > the word "antrag" loses 

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-03 Thread Maruan Sahyoun
 
> Am 02.06.20 um 22:20 schrieb Maruan Sahyoun:
> >   
> > > Maruan,
> > >To confirm, you're ok if we grant access to the server to our 
> > > colleagues
> > > on Tika and POI?
> > 
> > to be clear - my company is only sponsoring the box. It's the projects 
> > decision who needs access not mine. So feel free.
> Thanks a lot Maruan! Should we mention your company somewhere as sponsor?

I'm glad that I can give something back to the projects. Mention the sponsoring 
is not needed. With the corpora files PDFBox,
Tika, POI others have one of the best base of real world files. I remember when 
doing a presentation at PDF Days some times ago
that people were really impressed about our testing. 
BR
Maruan

> 
> Andreas
> 
> 
> > BR
> > Maruan
> > 
> > 
> > >Again, wow, THANK YOU!
> > > 
> > > Best,
> > > 
> > >Tim
> > > 
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:
> > > 
> > > > > proper domain for https access
> > > > 
> > > > I just pinged infra on slack.
> > > > 
> > > > If they're able to do it, what would we want?
> > > > 
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > > 
> > > > Something else?  I'm also happy to buy a domain if that won't work.  
> > > > There
> > > > are a couple available that are close enough.
> > > > 
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> > > > wrote:
> > > > 
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > > 
> > > > > > If ubuntu is possible at all, that's what I've been working with 
> > > > > > most
> > > > > > recently.
> > > > > 
> > > > > OK - will setup with that distro
> > > > > 
> > > > > > Other than that, ssh access and sudo privileges would be all I'd 
> > > > > > need.
> > > > > > 
> > > > > > Are you ok if we set up apache httpd to host files for the public or
> > > > > will
> > > > > > this be a community only resource?
> > > > > 
> > > > > it can be used for whatever we want it to - so if you consider public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be 
> > > > > something
> > > > > infra can do?
> > > > > 
> > > > > > If this is corporate sponsored, please let me know how/if we should
> > > > > mention
> > > > > > the sponsorship.
> > > > > 
> > > > > no need to mention it - happy to help.
> > > > > 
> > > > > > Again...wow.  Thank you!
> > > > > > 
> > > > > > Best,
> > > > > > 
> > > > > >Tim
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> > > > > > 
> > > > > > wrote:
> > > > > > 
> > > > > > > Could fund either:
> > > > > > > 
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > > 
> > > > > > > or
> > > > > > > 
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > All are root servers so one has to vote for taking care of them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > > There are two use cases.
> > > > > > > > 
> > > > > > > > 1) host shared data so that we can all point to and work from 
> > > > > > > > the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > > 
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > > 
> > > > > > > > We could use help with either or both.
> > > > > > > > 
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > > 
> > > > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > > > bottlenecks.
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahy...@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > is that a storage box only or does it need to do some 
> > > > > > > > > computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > > > 
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > > 
> > > > > > > > > >   Yes, more than happy to share.
> > > > > > > > > > 
> > > > > > > > > > If anyone has recommendations for file hosting for a couple 
> > > > > > > > > > of
> > > > > TB,
> > > > > > > let me
> > > > > > > > > > know.
> > > > > > > > > > 
> > > > > > > > > > One option would be to work with CommonCrawl to bump the 

Re: Release 2.0.20 ?

2020-06-03 Thread Andreas Lehmkuehler

Thanks Tim and Tilman,

it looks like we are good to go. I'm going to cut the release tomorrow evening 
CEST.

Andreas

Am 02.06.20 um 19:12 schrieb Tilman Hausherr:
After checking two actual files (thanks Tim) I agree. The differences are minor 
and related to cases where it is difficult to get anything. Other differences 
are improvements.


Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:



Reports are available here:
https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz 



Looks like there are trivial differences in content with a slight
improvement over 2.0.19.  I don't see any differences in exceptions or
attachments.

Cheers,

 Tim




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Andreas Lehmkuehler

Am 02.06.20 um 23:29 schrieb Tim Allison:

https://issues.apache.org/jira/browse/INFRA-20372

On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
happy with corpora.pdfbox.apache.org or anything else.  Please let us know
what you think over on that ticket.
IMHO it should be either corpora.pdfbox.apache.org or corpora.tika.apache.org. I 
would prefer the latter, as tika is the tools which is mainly used here.


Andreas



Thank you, again!

Cheers,

  Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:


proper domain for https access


I just pinged infra on slack.

If they're able to do it, what would we want?

file-corpora.apache.org
corpora.apache.org
corpora-pdfbox.apache.org
corpora-tika.apache.org

Something else?  I'm also happy to buy a domain if that won't work.  There
are a couple available that are close enough.

On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
wrote:




AMD ryzen looks fantastic.  Others would be great as well.

If ubuntu is possible at all, that's what I've been working with most
recently.


OK - will setup with that distro



Other than that, ssh access and sudo privileges would be all I'd need.

Are you ok if we set up apache httpd to host files for the public or

will

this be a community only resource?


it can be used for whatever we want it to - so if you consider public
file sharing useful of course we can do that. Would be
good if we get a proper domain for https access. Would that be something
infra can do?



If this is corporate sponsored, please let me know how/if we should

mention

the sponsorship.


no need to mention it - happy to help.



Again...wow.  Thank you!

Best,

   Tim

On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
wrote:


Could fund either:

AMD Ryzen 5 3600
64 GB RAM
2x2TB

or

AMD Ryzen 7 3700X based Server
64 GB RAM
2x8TB

or
Intel® Core™ i9-9900K
64 GB RAM
2x8TB

All are root servers so one has to vote for taking care of them (I

can do

the initial setup).



BR
Maruan


There are two use cases.

1) host shared data so that we can all point to and work from the

same

data, ideally both literal docs and also extracts (text/metadata

.json

files representing extracted information).

2) a modest vm to allow all of us to run the regression tests

We could use help with either or both.

What we had before:
8 GB RAM
8 cores
4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging

We can always use more RAM and more cores up to the point of I/O
bottlenecks.

On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <

sahy...@fileaffairs.de>

wrote:


is that a storage box only or does it need to do some computings

too?


Maybe you could write a small spec for the server requirement?

BR
Maruan



Still haven’t had time to put the server in a dmz. Ugh.

  Yes, more than happy to share.

If anyone has recommendations for file hosting for a couple of

TB,

let me

know.

One option would be to work with CommonCrawl to bump the max

file

size

one

crawl a year...

On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <

thaush...@t-online.de>

wrote:


Can we / I access these files? Most differences are

improvements

or not

meaningful, but there are a few I'd like to have a look, e.g.

commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T

the word "antrag" loses the first "a". Although maybe the "a"

was

a big

one and gets assigned to another line.

Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:

Reports are available here:



https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz

Looks like there are trivial differences in content with a

slight

improvement over 2.0.19.  I don't see any differences in

exceptions

or

attachments.

Cheers,

  Tim


-

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827



--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org







-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Andreas Lehmkuehler

Am 02.06.20 um 22:20 schrieb Maruan Sahyoun:
  

Maruan,
   To confirm, you're ok if we grant access to the server to our colleagues
on Tika and POI?


to be clear - my company is only sponsoring the box. It's the projects decision 
who needs access not mine. So feel free.

Thanks a lot Maruan! Should we mention your company somewhere as sponsor?

Andreas




BR
Maruan



   Again, wow, THANK YOU!

Best,

   Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:


proper domain for https access


I just pinged infra on slack.

If they're able to do it, what would we want?

file-corpora.apache.org
corpora.apache.org
corpora-pdfbox.apache.org
corpora-tika.apache.org

Something else?  I'm also happy to buy a domain if that won't work.  There
are a couple available that are close enough.

On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
wrote:


AMD ryzen looks fantastic.  Others would be great as well.

If ubuntu is possible at all, that's what I've been working with most
recently.


OK - will setup with that distro


Other than that, ssh access and sudo privileges would be all I'd need.

Are you ok if we set up apache httpd to host files for the public or

will

this be a community only resource?


it can be used for whatever we want it to - so if you consider public
file sharing useful of course we can do that. Would be
good if we get a proper domain for https access. Would that be something
infra can do?


If this is corporate sponsored, please let me know how/if we should

mention

the sponsorship.


no need to mention it - happy to help.


Again...wow.  Thank you!

Best,

   Tim

On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
wrote:


Could fund either:

AMD Ryzen 5 3600
64 GB RAM
2x2TB

or

AMD Ryzen 7 3700X based Server
64 GB RAM
2x8TB

or
Intel® Core™ i9-9900K
64 GB RAM
2x8TB

All are root servers so one has to vote for taking care of them (I

can do

the initial setup).



BR
Maruan


There are two use cases.

1) host shared data so that we can all point to and work from the

same

data, ideally both literal docs and also extracts (text/metadata

.json

files representing extracted information).

2) a modest vm to allow all of us to run the regression tests

We could use help with either or both.

What we had before:
8 GB RAM
8 cores
4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging

We can always use more RAM and more cores up to the point of I/O
bottlenecks.

On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <

sahy...@fileaffairs.de>

wrote:


is that a storage box only or does it need to do some computings

too?

Maybe you could write a small spec for the server requirement?

BR
Maruan



Still haven’t had time to put the server in a dmz. Ugh.

  Yes, more than happy to share.

If anyone has recommendations for file hosting for a couple of

TB,

let me

know.

One option would be to work with CommonCrawl to bump the max

file

size

one

crawl a year...

On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <

thaush...@t-online.de>

wrote:


Can we / I access these files? Most differences are

improvements

or not

meaningful, but there are a few I'd like to have a look, e.g.

commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T

the word "antrag" loses the first "a". Although maybe the "a"

was

a big

one and gets assigned to another line.

Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:

Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz

Looks like there are trivial differences in content with a

slight

improvement over 2.0.19.  I don't see any differences in

exceptions

or

attachments.

Cheers,

  Tim


-

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827



--
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
 
> I'm rsync'ing the data over now.  I probably won't get around to setting up 
> httpd this week, but if anyone else wants to take it, go for it.  This will 
> at least get team members access to the files asap.

I can take care of httpd but would prefer to wait until the subdomain/cert is 
done as I'd go for https only. If access is needed
quicker let me know - I'd do an initial setup in that case.

BR
Maruan

> 
> I've disabled login via password. 
> 
> If anyone feels that I'm doing something wrong, please let me know!
> 
> Cheers and thank you Maruan!
> 
> Tim
> 
> On Tue, Jun 2, 2020 at 4:20 PM Maruan Sahyoun  wrote:
> >  
> > > Maruan,
> > >   To confirm, you're ok if we grant access to the server to our colleagues
> > > on Tika and POI?
> > 
> > to be clear - my company is only sponsoring the box. It's the projects 
> > decision who needs access not mine. So feel free.
> > 
> > BR
> > Maruan
> > 
> > 
> > >   Again, wow, THANK YOU!
> > > 
> > >Best,
> > > 
> > >   Tim
> > > 
> > > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:
> > > 
> > > > > proper domain for https access
> > > > 
> > > > I just pinged infra on slack.
> > > > 
> > > > If they're able to do it, what would we want?
> > > > 
> > > > file-corpora.apache.org
> > > > corpora.apache.org
> > > > corpora-pdfbox.apache.org
> > > > corpora-tika.apache.org
> > > > 
> > > > Something else?  I'm also happy to buy a domain if that won't work.  
> > > > There
> > > > are a couple available that are close enough.
> > > > 
> > > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> > > > wrote:
> > > > 
> > > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > > > 
> > > > > > If ubuntu is possible at all, that's what I've been working with 
> > > > > > most
> > > > > > recently.
> > > > > 
> > > > > OK - will setup with that distro
> > > > > 
> > > > > > Other than that, ssh access and sudo privileges would be all I'd 
> > > > > > need.
> > > > > > 
> > > > > > Are you ok if we set up apache httpd to host files for the public or
> > > > > will
> > > > > > this be a community only resource?
> > > > > 
> > > > > it can be used for whatever we want it to - so if you consider public
> > > > > file sharing useful of course we can do that. Would be
> > > > > good if we get a proper domain for https access. Would that be 
> > > > > something
> > > > > infra can do?
> > > > > 
> > > > > > If this is corporate sponsored, please let me know how/if we should
> > > > > mention
> > > > > > the sponsorship.
> > > > > 
> > > > > no need to mention it - happy to help.
> > > > > 
> > > > > > Again...wow.  Thank you!
> > > > > > 
> > > > > > Best,
> > > > > > 
> > > > > >   Tim
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> > > > > > 
> > > > > > wrote:
> > > > > > 
> > > > > > > Could fund either:
> > > > > > > 
> > > > > > > AMD Ryzen 5 3600
> > > > > > > 64 GB RAM
> > > > > > > 2x2TB
> > > > > > > 
> > > > > > > or
> > > > > > > 
> > > > > > > AMD Ryzen 7 3700X based Server
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > or
> > > > > > > Intel® Core™ i9-9900K
> > > > > > > 64 GB RAM
> > > > > > > 2x8TB
> > > > > > > 
> > > > > > > All are root servers so one has to vote for taking care of them (I
> > > > > can do
> > > > > > > the initial setup).
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > > There are two use cases.
> > > > > > > > 
> > > > > > > > 1) host shared data so that we can all point to and work from 
> > > > > > > > the
> > > > > same
> > > > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > > > .json
> > > > > > > > files representing extracted information).
> > > > > > > > 
> > > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > > > 
> > > > > > > > We could use help with either or both.
> > > > > > > > 
> > > > > > > > What we had before:
> > > > > > > > 8 GB RAM
> > > > > > > > 8 cores
> > > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > > > 
> > > > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > > > bottlenecks.
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > > sahy...@fileaffairs.de>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > is that a storage box only or does it need to do some 
> > > > > > > > > computings
> > > > > too?
> > > > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > > > 
> > > > > > > > > BR
> > > > > > > > > Maruan
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > > > 
> > > > > > > > > >  Yes, more than happy to share.
> > > > > > > > > > 
> > > > > > > > > > If anyone has recommendations for file hosting for a couple 
> > > > > > > > > > of

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
https://issues.apache.org/jira/browse/INFRA-20372

On Slack, Gavin suggested something like corpora.tika.apache.org.  I'm
happy with corpora.pdfbox.apache.org or anything else.  Please let us know
what you think over on that ticket.

Thank you, again!

Cheers,

 Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:

> >proper domain for https access
>
> I just pinged infra on slack.
>
> If they're able to do it, what would we want?
>
> file-corpora.apache.org
> corpora.apache.org
> corpora-pdfbox.apache.org
> corpora-tika.apache.org
>
> Something else?  I'm also happy to buy a domain if that won't work.  There
> are a couple available that are close enough.
>
> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> wrote:
>
>>
>> > AMD ryzen looks fantastic.  Others would be great as well.
>> >
>> > If ubuntu is possible at all, that's what I've been working with most
>> > recently.
>>
>> OK - will setup with that distro
>>
>> >
>> > Other than that, ssh access and sudo privileges would be all I'd need.
>> >
>> > Are you ok if we set up apache httpd to host files for the public or
>> will
>> > this be a community only resource?
>>
>> it can be used for whatever we want it to - so if you consider public
>> file sharing useful of course we can do that. Would be
>> good if we get a proper domain for https access. Would that be something
>> infra can do?
>>
>> >
>> > If this is corporate sponsored, please let me know how/if we should
>> mention
>> > the sponsorship.
>>
>> no need to mention it - happy to help.
>>
>> >
>> > Again...wow.  Thank you!
>> >
>> > Best,
>> >
>> >   Tim
>> >
>> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
>> > wrote:
>> >
>> > > Could fund either:
>> > >
>> > > AMD Ryzen 5 3600
>> > > 64 GB RAM
>> > > 2x2TB
>> > >
>> > > or
>> > >
>> > > AMD Ryzen 7 3700X based Server
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > or
>> > > Intel® Core™ i9-9900K
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > All are root servers so one has to vote for taking care of them (I
>> can do
>> > > the initial setup).
>> > >
>> > >
>> > >
>> > > BR
>> > > Maruan
>> > >
>> > > > There are two use cases.
>> > > >
>> > > > 1) host shared data so that we can all point to and work from the
>> same
>> > > > data, ideally both literal docs and also extracts (text/metadata
>> .json
>> > > > files representing extracted information).
>> > > >
>> > > > 2) a modest vm to allow all of us to run the regression tests
>> > > >
>> > > > We could use help with either or both.
>> > > >
>> > > > What we had before:
>> > > > 8 GB RAM
>> > > > 8 cores
>> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>> > > >
>> > > > We can always use more RAM and more cores up to the point of I/O
>> > > > bottlenecks.
>> > > >
>> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>> sahy...@fileaffairs.de>
>> > > > wrote:
>> > > >
>> > > > > is that a storage box only or does it need to do some computings
>> too?
>> > > > >
>> > > > > Maybe you could write a small spec for the server requirement?
>> > > > >
>> > > > > BR
>> > > > > Maruan
>> > > > >
>> > > > >
>> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
>> > > > > >
>> > > > > >  Yes, more than happy to share.
>> > > > > >
>> > > > > > If anyone has recommendations for file hosting for a couple of
>> TB,
>> > > let me
>> > > > > > know.
>> > > > > >
>> > > > > > One option would be to work with CommonCrawl to bump the max
>> file
>> > > size
>> > > > > one
>> > > > > > crawl a year...
>> > > > > >
>> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>> > > thaush...@t-online.de>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Can we / I access these files? Most differences are
>> improvements
>> > > or not
>> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
>> > > > > > >
>> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>> > > > > > >
>> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
>> was
>> > > a big
>> > > > > > > one and gets assigned to another line.
>> > > > > > >
>> > > > > > > Tilman
>> > > > > > >
>> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
>> > > > > > > > > > Reports are available here:
>> > >
>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>> > > > > > > > Looks like there are trivial differences in content with a
>> slight
>> > > > > > > > improvement over 2.0.19.  I don't see any differences in
>> > > exceptions
>> > > > > or
>> > > > > > > > attachments.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > >
>> > > > > > > >  Tim
>> > > > > > > >
>> > > -
>> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> > > > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
>> > > > > > >
>> > > > > > >
>> > > --
>> > > Maruan Sahyoun
>> > >
>> > > FileAffairs GmbH
>> > > 

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
I'm rsync'ing the data over now.  I probably won't get around to setting up
httpd this week, but if anyone else wants to take it, go for it.  This will
at least get team members access to the files asap.

I've disabled login via password.

If anyone feels that I'm doing something wrong, please let me know!

Cheers and thank you Maruan!

Tim

On Tue, Jun 2, 2020 at 4:20 PM Maruan Sahyoun 
wrote:

>
> > Maruan,
> >   To confirm, you're ok if we grant access to the server to our
> colleagues
> > on Tika and POI?
>
> to be clear - my company is only sponsoring the box. It's the projects
> decision who needs access not mine. So feel free.
>
> BR
> Maruan
>
>
> >   Again, wow, THANK YOU!
> >
> >Best,
> >
> >   Tim
> >
> > On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:
> >
> > > > proper domain for https access
> > >
> > > I just pinged infra on slack.
> > >
> > > If they're able to do it, what would we want?
> > >
> > > file-corpora.apache.org
> > > corpora.apache.org
> > > corpora-pdfbox.apache.org
> > > corpora-tika.apache.org
> > >
> > > Something else?  I'm also happy to buy a domain if that won't work.
> There
> > > are a couple available that are close enough.
> > >
> > > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> > > wrote:
> > >
> > > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > >
> > > > > If ubuntu is possible at all, that's what I've been working with
> most
> > > > > recently.
> > > >
> > > > OK - will setup with that distro
> > > >
> > > > > Other than that, ssh access and sudo privileges would be all I'd
> need.
> > > > >
> > > > > Are you ok if we set up apache httpd to host files for the public
> or
> > > > will
> > > > > this be a community only resource?
> > > >
> > > > it can be used for whatever we want it to - so if you consider public
> > > > file sharing useful of course we can do that. Would be
> > > > good if we get a proper domain for https access. Would that be
> something
> > > > infra can do?
> > > >
> > > > > If this is corporate sponsored, please let me know how/if we should
> > > > mention
> > > > > the sponsorship.
> > > >
> > > > no need to mention it - happy to help.
> > > >
> > > > > Again...wow.  Thank you!
> > > > >
> > > > > Best,
> > > > >
> > > > >   Tim
> > > > >
> > > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <
> sahy...@fileaffairs.de>
> > > > > wrote:
> > > > >
> > > > > > Could fund either:
> > > > > >
> > > > > > AMD Ryzen 5 3600
> > > > > > 64 GB RAM
> > > > > > 2x2TB
> > > > > >
> > > > > > or
> > > > > >
> > > > > > AMD Ryzen 7 3700X based Server
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > >
> > > > > > or
> > > > > > Intel® Core™ i9-9900K
> > > > > > 64 GB RAM
> > > > > > 2x8TB
> > > > > >
> > > > > > All are root servers so one has to vote for taking care of them
> (I
> > > > can do
> > > > > > the initial setup).
> > > > > >
> > > > > >
> > > > > >
> > > > > > BR
> > > > > > Maruan
> > > > > >
> > > > > > > There are two use cases.
> > > > > > >
> > > > > > > 1) host shared data so that we can all point to and work from
> the
> > > > same
> > > > > > > data, ideally both literal docs and also extracts
> (text/metadata
> > > > .json
> > > > > > > files representing extracted information).
> > > > > > >
> > > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > >
> > > > > > > We could use help with either or both.
> > > > > > >
> > > > > > > What we had before:
> > > > > > > 8 GB RAM
> > > > > > > 8 cores
> > > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > >
> > > > > > > We can always use more RAM and more cores up to the point of
> I/O
> > > > > > > bottlenecks.
> > > > > > >
> > > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > > sahy...@fileaffairs.de>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > is that a storage box only or does it need to do some
> computings
> > > > too?
> > > > > > > > Maybe you could write a small spec for the server
> requirement?
> > > > > > > >
> > > > > > > > BR
> > > > > > > > Maruan
> > > > > > > >
> > > > > > > >
> > > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > >
> > > > > > > > >  Yes, more than happy to share.
> > > > > > > > >
> > > > > > > > > If anyone has recommendations for file hosting for a
> couple of
> > > > TB,
> > > > > > let me
> > > > > > > > > know.
> > > > > > > > >
> > > > > > > > > One option would be to work with CommonCrawl to bump the
> max
> > > > file
> > > > > > size
> > > > > > > > one
> > > > > > > > > crawl a year...
> > > > > > > > >
> > > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > > thaush...@t-online.de>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Can we / I access these files? Most differences are
> > > > improvements
> > > > > > or not
> > > > > > > > > > meaningful, but there are a few I'd like to have a look,
> e.g.
> > > > > > > > 

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
 
> Maruan,
>   To confirm, you're ok if we grant access to the server to our colleagues
> on Tika and POI?

to be clear - my company is only sponsoring the box. It's the projects decision 
who needs access not mine. So feel free.

BR
Maruan


>   Again, wow, THANK YOU!
> 
>Best,
> 
>   Tim
> 
> On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:
> 
> > > proper domain for https access
> > 
> > I just pinged infra on slack.
> > 
> > If they're able to do it, what would we want?
> > 
> > file-corpora.apache.org
> > corpora.apache.org
> > corpora-pdfbox.apache.org
> > corpora-tika.apache.org
> > 
> > Something else?  I'm also happy to buy a domain if that won't work.  There
> > are a couple available that are close enough.
> > 
> > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> > wrote:
> > 
> > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > 
> > > > If ubuntu is possible at all, that's what I've been working with most
> > > > recently.
> > > 
> > > OK - will setup with that distro
> > > 
> > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > 
> > > > Are you ok if we set up apache httpd to host files for the public or
> > > will
> > > > this be a community only resource?
> > > 
> > > it can be used for whatever we want it to - so if you consider public
> > > file sharing useful of course we can do that. Would be
> > > good if we get a proper domain for https access. Would that be something
> > > infra can do?
> > > 
> > > > If this is corporate sponsored, please let me know how/if we should
> > > mention
> > > > the sponsorship.
> > > 
> > > no need to mention it - happy to help.
> > > 
> > > > Again...wow.  Thank you!
> > > > 
> > > > Best,
> > > > 
> > > >   Tim
> > > > 
> > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> > > > wrote:
> > > > 
> > > > > Could fund either:
> > > > > 
> > > > > AMD Ryzen 5 3600
> > > > > 64 GB RAM
> > > > > 2x2TB
> > > > > 
> > > > > or
> > > > > 
> > > > > AMD Ryzen 7 3700X based Server
> > > > > 64 GB RAM
> > > > > 2x8TB
> > > > > 
> > > > > or
> > > > > Intel® Core™ i9-9900K
> > > > > 64 GB RAM
> > > > > 2x8TB
> > > > > 
> > > > > All are root servers so one has to vote for taking care of them (I
> > > can do
> > > > > the initial setup).
> > > > > 
> > > > > 
> > > > > 
> > > > > BR
> > > > > Maruan
> > > > > 
> > > > > > There are two use cases.
> > > > > > 
> > > > > > 1) host shared data so that we can all point to and work from the
> > > same
> > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > .json
> > > > > > files representing extracted information).
> > > > > > 
> > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > 
> > > > > > We could use help with either or both.
> > > > > > 
> > > > > > What we had before:
> > > > > > 8 GB RAM
> > > > > > 8 cores
> > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > 
> > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > bottlenecks.
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > sahy...@fileaffairs.de>
> > > > > > wrote:
> > > > > > 
> > > > > > > is that a storage box only or does it need to do some computings
> > > too?
> > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > 
> > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > 
> > > > > > > >  Yes, more than happy to share.
> > > > > > > > 
> > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > TB,
> > > > > let me
> > > > > > > > know.
> > > > > > > > 
> > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > file
> > > > > size
> > > > > > > one
> > > > > > > > crawl a year...
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > thaush...@t-online.de>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > Can we / I access these files? Most differences are
> > > improvements
> > > > > or not
> > > > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > > > 
> > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > 
> > > > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> > > was
> > > > > a big
> > > > > > > > > one and gets assigned to another line.
> > > > > > > > > 
> > > > > > > > > Tilman
> > > > > > > > > 
> > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > Reports are available here:
> > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > Looks like there are trivial differences in content with a
> > > slight
> > > > > > > > > > improvement over 2.0.19.  I don't see any differences in
> 

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
Maruan,
  To confirm, you're ok if we grant access to the server to our colleagues
on Tika and POI?
  Again, wow, THANK YOU!

   Best,

  Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison  wrote:

> >proper domain for https access
>
> I just pinged infra on slack.
>
> If they're able to do it, what would we want?
>
> file-corpora.apache.org
> corpora.apache.org
> corpora-pdfbox.apache.org
> corpora-tika.apache.org
>
> Something else?  I'm also happy to buy a domain if that won't work.  There
> are a couple available that are close enough.
>
> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
> wrote:
>
>>
>> > AMD ryzen looks fantastic.  Others would be great as well.
>> >
>> > If ubuntu is possible at all, that's what I've been working with most
>> > recently.
>>
>> OK - will setup with that distro
>>
>> >
>> > Other than that, ssh access and sudo privileges would be all I'd need.
>> >
>> > Are you ok if we set up apache httpd to host files for the public or
>> will
>> > this be a community only resource?
>>
>> it can be used for whatever we want it to - so if you consider public
>> file sharing useful of course we can do that. Would be
>> good if we get a proper domain for https access. Would that be something
>> infra can do?
>>
>> >
>> > If this is corporate sponsored, please let me know how/if we should
>> mention
>> > the sponsorship.
>>
>> no need to mention it - happy to help.
>>
>> >
>> > Again...wow.  Thank you!
>> >
>> > Best,
>> >
>> >   Tim
>> >
>> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
>> > wrote:
>> >
>> > > Could fund either:
>> > >
>> > > AMD Ryzen 5 3600
>> > > 64 GB RAM
>> > > 2x2TB
>> > >
>> > > or
>> > >
>> > > AMD Ryzen 7 3700X based Server
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > or
>> > > Intel® Core™ i9-9900K
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > All are root servers so one has to vote for taking care of them (I
>> can do
>> > > the initial setup).
>> > >
>> > >
>> > >
>> > > BR
>> > > Maruan
>> > >
>> > > > There are two use cases.
>> > > >
>> > > > 1) host shared data so that we can all point to and work from the
>> same
>> > > > data, ideally both literal docs and also extracts (text/metadata
>> .json
>> > > > files representing extracted information).
>> > > >
>> > > > 2) a modest vm to allow all of us to run the regression tests
>> > > >
>> > > > We could use help with either or both.
>> > > >
>> > > > What we had before:
>> > > > 8 GB RAM
>> > > > 8 cores
>> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>> > > >
>> > > > We can always use more RAM and more cores up to the point of I/O
>> > > > bottlenecks.
>> > > >
>> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>> sahy...@fileaffairs.de>
>> > > > wrote:
>> > > >
>> > > > > is that a storage box only or does it need to do some computings
>> too?
>> > > > >
>> > > > > Maybe you could write a small spec for the server requirement?
>> > > > >
>> > > > > BR
>> > > > > Maruan
>> > > > >
>> > > > >
>> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
>> > > > > >
>> > > > > >  Yes, more than happy to share.
>> > > > > >
>> > > > > > If anyone has recommendations for file hosting for a couple of
>> TB,
>> > > let me
>> > > > > > know.
>> > > > > >
>> > > > > > One option would be to work with CommonCrawl to bump the max
>> file
>> > > size
>> > > > > one
>> > > > > > crawl a year...
>> > > > > >
>> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>> > > thaush...@t-online.de>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Can we / I access these files? Most differences are
>> improvements
>> > > or not
>> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
>> > > > > > >
>> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>> > > > > > >
>> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
>> was
>> > > a big
>> > > > > > > one and gets assigned to another line.
>> > > > > > >
>> > > > > > > Tilman
>> > > > > > >
>> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
>> > > > > > > > > > Reports are available here:
>> > >
>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>> > > > > > > > Looks like there are trivial differences in content with a
>> slight
>> > > > > > > > improvement over 2.0.19.  I don't see any differences in
>> > > exceptions
>> > > > > or
>> > > > > > > > attachments.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > >
>> > > > > > > >  Tim
>> > > > > > > >
>> > > -
>> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> > > > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
>> > > > > > >
>> > > > > > >
>> > > --
>> > > Maruan Sahyoun
>> > >
>> > > FileAffairs GmbH
>> > > Josef-Schappe-Straße 21
>> > > 40882 Ratingen
>> > >
>> > > Tel: +49 (2102) 89497 88
>> > > Fax: +49 

New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
>proper domain for https access

I just pinged infra on slack.

If they're able to do it, what would we want?

file-corpora.apache.org
corpora.apache.org
corpora-pdfbox.apache.org
corpora-tika.apache.org

Something else?  I'm also happy to buy a domain if that won't work.  There
are a couple available that are close enough.

On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun 
wrote:

>
> > AMD ryzen looks fantastic.  Others would be great as well.
> >
> > If ubuntu is possible at all, that's what I've been working with most
> > recently.
>
> OK - will setup with that distro
>
> >
> > Other than that, ssh access and sudo privileges would be all I'd need.
> >
> > Are you ok if we set up apache httpd to host files for the public or will
> > this be a community only resource?
>
> it can be used for whatever we want it to - so if you consider public file
> sharing useful of course we can do that. Would be
> good if we get a proper domain for https access. Would that be something
> infra can do?
>
> >
> > If this is corporate sponsored, please let me know how/if we should
> mention
> > the sponsorship.
>
> no need to mention it - happy to help.
>
> >
> > Again...wow.  Thank you!
> >
> > Best,
> >
> >   Tim
> >
> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> > wrote:
> >
> > > Could fund either:
> > >
> > > AMD Ryzen 5 3600
> > > 64 GB RAM
> > > 2x2TB
> > >
> > > or
> > >
> > > AMD Ryzen 7 3700X based Server
> > > 64 GB RAM
> > > 2x8TB
> > >
> > > or
> > > Intel® Core™ i9-9900K
> > > 64 GB RAM
> > > 2x8TB
> > >
> > > All are root servers so one has to vote for taking care of them (I can
> do
> > > the initial setup).
> > >
> > >
> > >
> > > BR
> > > Maruan
> > >
> > > > There are two use cases.
> > > >
> > > > 1) host shared data so that we can all point to and work from the
> same
> > > > data, ideally both literal docs and also extracts (text/metadata
> .json
> > > > files representing extracted information).
> > > >
> > > > 2) a modest vm to allow all of us to run the regression tests
> > > >
> > > > We could use help with either or both.
> > > >
> > > > What we had before:
> > > > 8 GB RAM
> > > > 8 cores
> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > >
> > > > We can always use more RAM and more cores up to the point of I/O
> > > > bottlenecks.
> > > >
> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> sahy...@fileaffairs.de>
> > > > wrote:
> > > >
> > > > > is that a storage box only or does it need to do some computings
> too?
> > > > >
> > > > > Maybe you could write a small spec for the server requirement?
> > > > >
> > > > > BR
> > > > > Maruan
> > > > >
> > > > >
> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > >
> > > > > >  Yes, more than happy to share.
> > > > > >
> > > > > > If anyone has recommendations for file hosting for a couple of
> TB,
> > > let me
> > > > > > know.
> > > > > >
> > > > > > One option would be to work with CommonCrawl to bump the max file
> > > size
> > > > > one
> > > > > > crawl a year...
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > thaush...@t-online.de>
> > > > > > wrote:
> > > > > >
> > > > > > > Can we / I access these files? Most differences are
> improvements
> > > or not
> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > >
> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > >
> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> was
> > > a big
> > > > > > > one and gets assigned to another line.
> > > > > > >
> > > > > > > Tilman
> > > > > > >
> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > Looks like there are trivial differences in content with a
> slight
> > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > exceptions
> > > > > or
> > > > > > > > attachments.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > >  Tim
> > > > > > > >
> > > -
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > > > >
> > > > > > >
> > > --
> > > Maruan Sahyoun
> > >
> > > FileAffairs GmbH
> > > Josef-Schappe-Straße 21
> > > 40882 Ratingen
> > >
> > > Tel: +49 (2102) 89497 88
> > > Fax: +49 (2102) 89497 91
> > > sahy...@fileaffairs.de
> > > www.fileaffairs.de
> > >
> > > Geschäftsführer: Maruan Sahyoun
> > > Handelsregister: AG Düsseldorf, HRB 53837
> > > UST.-ID: DE248275827
> > >
> > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahy...@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan 

Re: Release 2.0.20 ?

2020-06-02 Thread Tilman Hausherr

Am 02.06.2020 um 19:24 schrieb Maruan Sahyoun:

Order placed. Once the server is available and the initial setup done I'll post 
here. Should be done by end of week depending on
my other workload.



Thanks!!

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
 
>  
> > AMD ryzen looks fantastic.  Others would be great as well.
> > 
> > If ubuntu is possible at all, that's what I've been working with most
> > recently.
> 
> OK - will setup with that distro
> 
> > Other than that, ssh access and sudo privileges would be all I'd need.
> > 
> > Are you ok if we set up apache httpd to host files for the public or will
> > this be a community only resource?
> 
> it can be used for whatever we want it to - so if you consider public file 
> sharing useful of course we can do that. Would be
> good if we get a proper domain for https access. Would that be something 
> infra can do?
> 
> > If this is corporate sponsored, please let me know how/if we should mention
> > the sponsorship.
> 
> no need to mention it - happy to help. 
> 
> > Again...wow.  Thank you!

Order placed. Once the server is available and the initial setup done I'll post 
here. Should be done by end of week depending on
my other workload.

BR
Maruan


> > 
> > Best,
> > 
> >   Tim
> > 
> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> > wrote:
> > 
> > > Could fund either:
> > > 
> > > AMD Ryzen 5 3600
> > > 64 GB RAM
> > > 2x2TB
> > > 
> > > or
> > > 
> > > AMD Ryzen 7 3700X based Server
> > > 64 GB RAM
> > > 2x8TB
> > > 
> > > or
> > > Intel® Core™ i9-9900K
> > > 64 GB RAM
> > > 2x8TB
> > > 
> > > All are root servers so one has to vote for taking care of them (I can do
> > > the initial setup).
> > > 
> > > 
> > > 
> > > BR
> > > Maruan
> > > 
> > > > There are two use cases.
> > > > 
> > > > 1) host shared data so that we can all point to and work from the same
> > > > data, ideally both literal docs and also extracts (text/metadata .json
> > > > files representing extracted information).
> > > > 
> > > > 2) a modest vm to allow all of us to run the regression tests
> > > > 
> > > > We could use help with either or both.
> > > > 
> > > > What we had before:
> > > > 8 GB RAM
> > > > 8 cores
> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > 
> > > > We can always use more RAM and more cores up to the point of I/O
> > > > bottlenecks.
> > > > 
> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun 
> > > > wrote:
> > > > 
> > > > > is that a storage box only or does it need to do some computings too?
> > > > > 
> > > > > Maybe you could write a small spec for the server requirement?
> > > > > 
> > > > > BR
> > > > > Maruan
> > > > > 
> > > > > 
> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > 
> > > > > >  Yes, more than happy to share.
> > > > > > 
> > > > > > If anyone has recommendations for file hosting for a couple of TB,
> > > let me
> > > > > > know.
> > > > > > 
> > > > > > One option would be to work with CommonCrawl to bump the max file
> > > size
> > > > > one
> > > > > > crawl a year...
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > thaush...@t-online.de>
> > > > > > wrote:
> > > > > > 
> > > > > > > Can we / I access these files? Most differences are improvements
> > > or not
> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > 
> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > 
> > > > > > > the word "antrag" loses the first "a". Although maybe the "a" was
> > > a big
> > > > > > > one and gets assigned to another line.
> > > > > > > 
> > > > > > > Tilman
> > > > > > > 
> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > Reports are available here:
> > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > Looks like there are trivial differences in content with a 
> > > > > > > > slight
> > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > exceptions
> > > > > or
> > > > > > > > attachments.
> > > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > 
> > > > > > > >  Tim
> > > > > > > > 
> > > -
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > > > > 
> > > > > > > 
> > > --
> > > Maruan Sahyoun
> > > 
> > > FileAffairs GmbH
> > > Josef-Schappe-Straße 21
> > > 40882 Ratingen
> > > 
> > > Tel: +49 (2102) 89497 88
> > > Fax: +49 (2102) 89497 91
> > > sahy...@fileaffairs.de
> > > www.fileaffairs.de
> > > 
> > > Geschäftsführer: Maruan Sahyoun
> > > Handelsregister: AG Düsseldorf, HRB 53837
> > > UST.-ID: DE248275827
> > > 
> > > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, 

Re: Release 2.0.20 ?

2020-06-02 Thread Tilman Hausherr
After checking two actual files (thanks Tim) I agree. The differences 
are minor and related to cases where it is difficult to get anything. 
Other differences are improvements.


Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:



Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz

Looks like there are trivial differences in content with a slight
improvement over 2.0.19.  I don't see any differences in exceptions or
attachments.

Cheers,

 Tim




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
 
> AMD ryzen looks fantastic.  Others would be great as well.
> 
> If ubuntu is possible at all, that's what I've been working with most
> recently.

OK - will setup with that distro

> 
> Other than that, ssh access and sudo privileges would be all I'd need.
> 
> Are you ok if we set up apache httpd to host files for the public or will
> this be a community only resource?

it can be used for whatever we want it to - so if you consider public file 
sharing useful of course we can do that. Would be
good if we get a proper domain for https access. Would that be something infra 
can do?

> 
> If this is corporate sponsored, please let me know how/if we should mention
> the sponsorship.

no need to mention it - happy to help. 

> 
> Again...wow.  Thank you!
> 
> Best,
> 
>   Tim
> 
> On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
> wrote:
> 
> > Could fund either:
> > 
> > AMD Ryzen 5 3600
> > 64 GB RAM
> > 2x2TB
> > 
> > or
> > 
> > AMD Ryzen 7 3700X based Server
> > 64 GB RAM
> > 2x8TB
> > 
> > or
> > Intel® Core™ i9-9900K
> > 64 GB RAM
> > 2x8TB
> > 
> > All are root servers so one has to vote for taking care of them (I can do
> > the initial setup).
> > 
> > 
> > 
> > BR
> > Maruan
> > 
> > > There are two use cases.
> > > 
> > > 1) host shared data so that we can all point to and work from the same
> > > data, ideally both literal docs and also extracts (text/metadata .json
> > > files representing extracted information).
> > > 
> > > 2) a modest vm to allow all of us to run the regression tests
> > > 
> > > We could use help with either or both.
> > > 
> > > What we had before:
> > > 8 GB RAM
> > > 8 cores
> > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > 
> > > We can always use more RAM and more cores up to the point of I/O
> > > bottlenecks.
> > > 
> > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun 
> > > wrote:
> > > 
> > > > is that a storage box only or does it need to do some computings too?
> > > > 
> > > > Maybe you could write a small spec for the server requirement?
> > > > 
> > > > BR
> > > > Maruan
> > > > 
> > > > 
> > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > 
> > > > >  Yes, more than happy to share.
> > > > > 
> > > > > If anyone has recommendations for file hosting for a couple of TB,
> > let me
> > > > > know.
> > > > > 
> > > > > One option would be to work with CommonCrawl to bump the max file
> > size
> > > > one
> > > > > crawl a year...
> > > > > 
> > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > thaush...@t-online.de>
> > > > > wrote:
> > > > > 
> > > > > > Can we / I access these files? Most differences are improvements
> > or not
> > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > 
> > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > 
> > > > > > the word "antrag" loses the first "a". Although maybe the "a" was
> > a big
> > > > > > one and gets assigned to another line.
> > > > > > 
> > > > > > Tilman
> > > > > > 
> > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > Reports are available here:
> > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > Looks like there are trivial differences in content with a slight
> > > > > > > improvement over 2.0.19.  I don't see any differences in
> > exceptions
> > > > or
> > > > > > > attachments.
> > > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > >  Tim
> > > > > > > 
> > -
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > > > 
> > > > > > 
> > --
> > Maruan Sahyoun
> > 
> > FileAffairs GmbH
> > Josef-Schappe-Straße 21
> > 40882 Ratingen
> > 
> > Tel: +49 (2102) 89497 88
> > Fax: +49 (2102) 89497 91
> > sahy...@fileaffairs.de
> > www.fileaffairs.de
> > 
> > Geschäftsführer: Maruan Sahyoun
> > Handelsregister: AG Düsseldorf, HRB 53837
> > UST.-ID: DE248275827
> > 
> > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Tilman Hausherr
After checking two actual files (thanks Tim) I agree. The differences 
are minor and related to cases where it is difficult to get anything. 
Other differences are improvements.


Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:



Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz

Looks like there are trivial differences in content with a slight
improvement over 2.0.19.  I don't see any differences in exceptions or
attachments.

Cheers,

 Tim




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
AMD ryzen looks fantastic.  Others would be great as well.

If ubuntu is possible at all, that's what I've been working with most
recently.

Other than that, ssh access and sudo privileges would be all I'd need.

Are you ok if we set up apache httpd to host files for the public or will
this be a community only resource?

If this is corporate sponsored, please let me know how/if we should mention
the sponsorship.

Again...wow.  Thank you!

Best,

  Tim

On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
wrote:

> Could fund either:
>
> AMD Ryzen 5 3600
> 64 GB RAM
> 2x2TB
>
> or
>
> AMD Ryzen 7 3700X based Server
> 64 GB RAM
> 2x8TB
>
> or
> Intel® Core™ i9-9900K
> 64 GB RAM
> 2x8TB
>
> All are root servers so one has to vote for taking care of them (I can do
> the initial setup).
>
>
>
> BR
> Maruan
>
> > There are two use cases.
> >
> > 1) host shared data so that we can all point to and work from the same
> > data, ideally both literal docs and also extracts (text/metadata .json
> > files representing extracted information).
> >
> > 2) a modest vm to allow all of us to run the regression tests
> >
> > We could use help with either or both.
> >
> > What we had before:
> > 8 GB RAM
> > 8 cores
> > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> >
> > We can always use more RAM and more cores up to the point of I/O
> > bottlenecks.
> >
> > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun 
> > wrote:
> >
> > > is that a storage box only or does it need to do some computings too?
> > >
> > > Maybe you could write a small spec for the server requirement?
> > >
> > > BR
> > > Maruan
> > >
> > >
> > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > >
> > > >  Yes, more than happy to share.
> > > >
> > > > If anyone has recommendations for file hosting for a couple of TB,
> let me
> > > > know.
> > > >
> > > > One option would be to work with CommonCrawl to bump the max file
> size
> > > one
> > > > crawl a year...
> > > >
> > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> thaush...@t-online.de>
> > > > wrote:
> > > >
> > > > > Can we / I access these files? Most differences are improvements
> or not
> > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > >
> > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > >
> > > > > the word "antrag" loses the first "a". Although maybe the "a" was
> a big
> > > > > one and gets assigned to another line.
> > > > >
> > > > > Tilman
> > > > >
> > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > Looks like there are trivial differences in content with a slight
> > > > > > improvement over 2.0.19.  I don't see any differences in
> exceptions
> > > or
> > > > > > attachments.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > >  Tim
> > > > > >
> > > > >
> > > > >
> -
> > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > >
> > > > >
> > >
> > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahy...@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
>


Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
I'd be more than happy to help with maintenance.  This would be AMAZING!

On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun 
wrote:

> Could fund either:
>
> AMD Ryzen 5 3600
> 64 GB RAM
> 2x2TB
>
> or
>
> AMD Ryzen 7 3700X based Server
> 64 GB RAM
> 2x8TB
>
> or
> Intel® Core™ i9-9900K
> 64 GB RAM
> 2x8TB
>
> All are root servers so one has to vote for taking care of them (I can do
> the initial setup).
>
>
>
> BR
> Maruan
>
> > There are two use cases.
> >
> > 1) host shared data so that we can all point to and work from the same
> > data, ideally both literal docs and also extracts (text/metadata .json
> > files representing extracted information).
> >
> > 2) a modest vm to allow all of us to run the regression tests
> >
> > We could use help with either or both.
> >
> > What we had before:
> > 8 GB RAM
> > 8 cores
> > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> >
> > We can always use more RAM and more cores up to the point of I/O
> > bottlenecks.
> >
> > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun 
> > wrote:
> >
> > > is that a storage box only or does it need to do some computings too?
> > >
> > > Maybe you could write a small spec for the server requirement?
> > >
> > > BR
> > > Maruan
> > >
> > >
> > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > >
> > > >  Yes, more than happy to share.
> > > >
> > > > If anyone has recommendations for file hosting for a couple of TB,
> let me
> > > > know.
> > > >
> > > > One option would be to work with CommonCrawl to bump the max file
> size
> > > one
> > > > crawl a year...
> > > >
> > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> thaush...@t-online.de>
> > > > wrote:
> > > >
> > > > > Can we / I access these files? Most differences are improvements
> or not
> > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > >
> > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > >
> > > > > the word "antrag" loses the first "a". Although maybe the "a" was
> a big
> > > > > one and gets assigned to another line.
> > > > >
> > > > > Tilman
> > > > >
> > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > Looks like there are trivial differences in content with a slight
> > > > > > improvement over 2.0.19.  I don't see any differences in
> exceptions
> > > or
> > > > > > attachments.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > >  Tim
> > > > > >
> > > > >
> > > > >
> -
> > > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > >
> > > > >
> > >
> > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahy...@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
>


Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
Could fund either:

AMD Ryzen 5 3600 
64 GB RAM
2x2TB

or

AMD Ryzen 7 3700X based Server
64 GB RAM
2x8TB

or
Intel® Core™ i9-9900K
64 GB RAM
2x8TB

All are root servers so one has to vote for taking care of them (I can do the 
initial setup).



BR
Maruan
 
> There are two use cases.
> 
> 1) host shared data so that we can all point to and work from the same
> data, ideally both literal docs and also extracts (text/metadata .json
> files representing extracted information).
> 
> 2) a modest vm to allow all of us to run the regression tests
> 
> We could use help with either or both.
> 
> What we had before:
> 8 GB RAM
> 8 cores
> 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> 
> We can always use more RAM and more cores up to the point of I/O
> bottlenecks.
> 
> On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun 
> wrote:
> 
> > is that a storage box only or does it need to do some computings too?
> > 
> > Maybe you could write a small spec for the server requirement?
> > 
> > BR
> > Maruan
> > 
> > 
> > > Still haven’t had time to put the server in a dmz. Ugh.
> > > 
> > >  Yes, more than happy to share.
> > > 
> > > If anyone has recommendations for file hosting for a couple of TB, let me
> > > know.
> > > 
> > > One option would be to work with CommonCrawl to bump the max file size
> > one
> > > crawl a year...
> > > 
> > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr 
> > > wrote:
> > > 
> > > > Can we / I access these files? Most differences are improvements or not
> > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > 
> > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > 
> > > > the word "antrag" loses the first "a". Although maybe the "a" was a big
> > > > one and gets assigned to another line.
> > > > 
> > > > Tilman
> > > > 
> > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > Reports are available here:
> > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > Looks like there are trivial differences in content with a slight
> > > > > improvement over 2.0.19.  I don't see any differences in exceptions
> > or
> > > > > attachments.
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > >  Tim
> > > > > 
> > > > 
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > > > 
> > > > 
> > 
> > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
There are two use cases.

1) host shared data so that we can all point to and work from the same
data, ideally both literal docs and also extracts (text/metadata .json
files representing extracted information).

2) a modest vm to allow all of us to run the regression tests

We could use help with either or both.

What we had before:
8 GB RAM
8 cores
4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging

We can always use more RAM and more cores up to the point of I/O
bottlenecks.

On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun 
wrote:

> is that a storage box only or does it need to do some computings too?
>
> Maybe you could write a small spec for the server requirement?
>
> BR
> Maruan
>
>
> > Still haven’t had time to put the server in a dmz. Ugh.
> >
> >  Yes, more than happy to share.
> >
> > If anyone has recommendations for file hosting for a couple of TB, let me
> > know.
> >
> > One option would be to work with CommonCrawl to bump the max file size
> one
> > crawl a year...
> >
> > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr 
> > wrote:
> >
> > > Can we / I access these files? Most differences are improvements or not
> > > meaningful, but there are a few I'd like to have a look, e.g.
> > >
> > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > >
> > > the word "antrag" loses the first "a". Although maybe the "a" was a big
> > > one and gets assigned to another line.
> > >
> > > Tilman
> > >
> > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > Looks like there are trivial differences in content with a slight
> > > > improvement over 2.0.19.  I don't see any differences in exceptions
> or
> > > > attachments.
> > > >
> > > > Cheers,
> > > >
> > > >  Tim
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > >
> > >
>
>
>


Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
is that a storage box only or does it need to do some computings too?

Maybe you could write a small spec for the server requirement?

BR
Maruan

 
> Still haven’t had time to put the server in a dmz. Ugh.
> 
>  Yes, more than happy to share.
> 
> If anyone has recommendations for file hosting for a couple of TB, let me
> know.
> 
> One option would be to work with CommonCrawl to bump the max file size one
> crawl a year...
> 
> On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr 
> wrote:
> 
> > Can we / I access these files? Most differences are improvements or not
> > meaningful, but there are a few I'd like to have a look, e.g.
> > 
> > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > 
> > the word "antrag" loses the first "a". Although maybe the "a" was a big
> > one and gets assigned to another line.
> > 
> > Tilman
> > 
> > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > Reports are available here:
> > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > Looks like there are trivial differences in content with a slight
> > > improvement over 2.0.19.  I don't see any differences in exceptions or
> > > attachments.
> > > 
> > > Cheers,
> > > 
> > >  Tim
> > > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > 
> > 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
Our commoncrawl slice+bugtrackers are currently 1 TB, govdocs1 is another
.5 TB.

2 TB would safely cover the source documents that we're currently using.



On Tue, Jun 2, 2020 at 6:08 AM Maruan Sahyoun 
wrote:

> How many TB would that be?
>
> > Still haven’t had time to put the server in a dmz. Ugh.
> >
> >  Yes, more than happy to share.
> >
> > If anyone has recommendations for file hosting for a couple of TB, let me
> > know.
> >
> > One option would be to work with CommonCrawl to bump the max file size
> one
> > crawl a year...
> >
> > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr 
> > wrote:
> >
> > > Can we / I access these files? Most differences are improvements or not
> > > meaningful, but there are a few I'd like to have a look, e.g.
> > >
> > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > >
> > > the word "antrag" loses the first "a". Although maybe the "a" was a big
> > > one and gets assigned to another line.
> > >
> > > Tilman
> > >
> > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > Looks like there are trivial differences in content with a slight
> > > > improvement over 2.0.19.  I don't see any differences in exceptions
> or
> > > > attachments.
> > > >
> > > > Cheers,
> > > >
> > > >  Tim
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > >
> > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahy...@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
>


Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
How many TB would that be?
 
> Still haven’t had time to put the server in a dmz. Ugh.
> 
>  Yes, more than happy to share.
> 
> If anyone has recommendations for file hosting for a couple of TB, let me
> know.
> 
> One option would be to work with CommonCrawl to bump the max file size one
> crawl a year...
> 
> On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr 
> wrote:
> 
> > Can we / I access these files? Most differences are improvements or not
> > meaningful, but there are a few I'd like to have a look, e.g.
> > 
> > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > 
> > the word "antrag" loses the first "a". Although maybe the "a" was a big
> > one and gets assigned to another line.
> > 
> > Tilman
> > 
> > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > Reports are available here:
> > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > Looks like there are trivial differences in content with a slight
> > > improvement over 2.0.19.  I don't see any differences in exceptions or
> > > attachments.
> > > 
> > > Cheers,
> > > 
> > >  Tim
> > > 
> > 
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> > 
> > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
Still haven’t had time to put the server in a dmz. Ugh.

 Yes, more than happy to share.

If anyone has recommendations for file hosting for a couple of TB, let me
know.

One option would be to work with CommonCrawl to bump the max file size one
crawl a year...

On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr 
wrote:

> Can we / I access these files? Most differences are improvements or not
> meaningful, but there are a few I'd like to have a look, e.g.
>
> commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>
> the word "antrag" loses the first "a". Although maybe the "a" was a big
> one and gets assigned to another line.
>
> Tilman
>
> Am 02.06.2020 um 02:58 schrieb Tim Allison:
> >>
> >>> Reports are available here:
> >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> >
> > Looks like there are trivial differences in content with a slight
> > improvement over 2.0.19.  I don't see any differences in exceptions or
> > attachments.
> >
> > Cheers,
> >
> >  Tim
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>


Re: Release 2.0.20 ?

2020-06-01 Thread Tilman Hausherr
Can we / I access these files? Most differences are improvements or not 
meaningful, but there are a few I'd like to have a look, e.g.


commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T

the word "antrag" loses the first "a". Although maybe the "a" was a big 
one and gets assigned to another line.


Tilman

Am 02.06.2020 um 02:58 schrieb Tim Allison:



Reports are available here:

https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz

Looks like there are trivial differences in content with a slight
improvement over 2.0.19.  I don't see any differences in exceptions or
attachments.

Cheers,

 Tim




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
>
>
>> Reports are available here:
https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz

Looks like there are trivial differences in content with a slight
improvement over 2.0.19.  I don't see any differences in exceptions or
attachments.

Cheers,

Tim


Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
Got it.  Thank you.  That makes good sense.  Onwards!

On Mon, Jun 1, 2020 at 3:51 PM Tilman Hausherr 
wrote:

> Yes, we use this to test compiling on jenkins with a jdk6 system while
> running the build on a jdk8 system, at the request of Simon Steiner.
>
> It worked fine for a long time, although currently it doesn't. (Because
> it works only with some maven versions)
>
> Tilman
>
> Am 01.06.2020 um 21:01 schrieb Tim Allison:
> > Do we need this line?  Are we getting benefit from specifying a different
> > jdk via JAVA_HOME that we wouldn't get from letting maven use the active
> > "alternative"?
> >
> > Anyways, all good.  Sorry for the noise.
> >
> > On Mon, Jun 1, 2020 at 2:43 PM Tim Allison  wrote:
> >
> >> User error...I recently reimaged my laptop and forgot to set JAVA_HOME,
> >> which needs to be picked up here:
> >>
> >> ${env.JAVA_HOME}
> >>
> >> for use here:
> >>
> >> ${jdk.path}/bin/javac
> >>
> >>
> >> On Mon, Jun 1, 2020 at 2:37 PM Tim Allison  wrote:
> >>
> >>> I get the same behavior (compiler failing at fontbox) with Java 11 on
> >>> ubuntu.  I'm sure this is user error, but it is weird.
> >>>
> >>> Apache Maven 3.6.3
> >>> Maven home: /usr/share/maven
> >>> Java version: 11.0.7, vendor: AdoptOpenJDK, runtime:
> >>> /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
> >>> Default locale: en_US, platform encoding: UTF-8
> >>> OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
> >>> "unix"
> >>>
> >>> On Mon, Jun 1, 2020 at 2:24 PM Tim Allison 
> wrote:
> >>>
>  Thank you, Tilman.
> 
>  Apache Maven 3.6.3
>  Maven home: /usr/share/maven
>  Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
>  /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
>  Default locale: en_US, platform encoding: UTF-8
>  OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
>  "unix"
> 
>  I specified 3.8.0 and 3.8.1 and still got the following...with no
> useful
>  information...or...where do I look for useful information?
> 
>  [INFO]
> 
> 
>  [INFO] Reactor Summary for PDFBox reactor 2.0.20-SNAPSHOT:
>  [INFO]
>  [INFO] PDFBox parent .. SUCCESS [
>    2.719 s]
>  [INFO] Apache FontBox . FAILURE [
>    1.304 s]
>  [INFO] Apache XmpBox .. SKIPPED
>  [INFO] Apache PDFBox .. SKIPPED
>  [INFO] Apache Preflight ... SKIPPED
>  [INFO] Apache Preflight application ... SKIPPED
>  [INFO] Apache PDFBox Debugger . SKIPPED
>  [INFO] Apache PDFBox tools  SKIPPED
>  [INFO] Apache PDFBox application .. SKIPPED
>  [INFO] Apache PDFBox Debugger application . SKIPPED
>  [INFO] Apache PDFBox examples . SKIPPED
>  [INFO] PDFBox reactor . SKIPPED
>  [INFO]
> 
> 
>  [INFO] BUILD FAILURE
>  [INFO]
> 
> 
>  [INFO] Total time:  5.190 s
>  [INFO] Finished at: 2020-06-01T14:21:11-04:00
>  [INFO]
> 
> 
>  [ERROR] Failed to execute goal
>  org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile
>  (default-compile) on project fontbox: Compilation failure -> [Help 1]
>  [ERROR]
>  [ERROR] To see the full stack trace of the errors, re-run Maven with
> the
>  -e switch.
>  [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>  [ERROR]
>  [ERROR] For more information about the errors and possible solutions,
>  please read the following articles:
>  [ERROR] [Help 1]
>  http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>  [ERROR]
>  [ERROR] After correcting the problems, you can resume the build with
> the
>  command
>  [ERROR]   mvn  -rf :fontbox
> 
>  On Mon, Jun 1, 2020 at 1:07 PM Tilman Hausherr  >
>  wrote:
> 
> > Am 01.06.2020 um 18:51 schrieb Tim Allison:
> >> I'm having problems building...likely user error.
> >>
> >> On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the
> > commandline:
> >> fontbox fails to build "maven-compiler-plugin" Compilation failure
> > with no
> >> warnings or info on what failed even with -e -X.
> > Which java 8? I remember this came with early jdk8 versions, or when
> > java wasn't there.
> >
> > Another possibility would be to update the maven-compiler-plugin to
> > 3.8.0.
> >
> >
> >>
> >> On 

Re: Release 2.0.20 ?

2020-06-01 Thread Tilman Hausherr
Yes, we use this to test compiling on jenkins with a jdk6 system while 
running the build on a jdk8 system, at the request of Simon Steiner.


It worked fine for a long time, although currently it doesn't. (Because 
it works only with some maven versions)


Tilman

Am 01.06.2020 um 21:01 schrieb Tim Allison:

Do we need this line?  Are we getting benefit from specifying a different
jdk via JAVA_HOME that we wouldn't get from letting maven use the active
"alternative"?

Anyways, all good.  Sorry for the noise.

On Mon, Jun 1, 2020 at 2:43 PM Tim Allison  wrote:


User error...I recently reimaged my laptop and forgot to set JAVA_HOME,
which needs to be picked up here:

${env.JAVA_HOME}

for use here:

${jdk.path}/bin/javac


On Mon, Jun 1, 2020 at 2:37 PM Tim Allison  wrote:


I get the same behavior (compiler failing at fontbox) with Java 11 on
ubuntu.  I'm sure this is user error, but it is weird.

Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 11.0.7, vendor: AdoptOpenJDK, runtime:
/usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
"unix"

On Mon, Jun 1, 2020 at 2:24 PM Tim Allison  wrote:


Thank you, Tilman.

Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
"unix"

I specified 3.8.0 and 3.8.1 and still got the following...with no useful
information...or...where do I look for useful information?

[INFO]

[INFO] Reactor Summary for PDFBox reactor 2.0.20-SNAPSHOT:
[INFO]
[INFO] PDFBox parent .. SUCCESS [
  2.719 s]
[INFO] Apache FontBox . FAILURE [
  1.304 s]
[INFO] Apache XmpBox .. SKIPPED
[INFO] Apache PDFBox .. SKIPPED
[INFO] Apache Preflight ... SKIPPED
[INFO] Apache Preflight application ... SKIPPED
[INFO] Apache PDFBox Debugger . SKIPPED
[INFO] Apache PDFBox tools  SKIPPED
[INFO] Apache PDFBox application .. SKIPPED
[INFO] Apache PDFBox Debugger application . SKIPPED
[INFO] Apache PDFBox examples . SKIPPED
[INFO] PDFBox reactor . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time:  5.190 s
[INFO] Finished at: 2020-06-01T14:21:11-04:00
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile
(default-compile) on project fontbox: Compilation failure -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :fontbox

On Mon, Jun 1, 2020 at 1:07 PM Tilman Hausherr 
wrote:


Am 01.06.2020 um 18:51 schrieb Tim Allison:

I'm having problems building...likely user error.

On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the

commandline:

fontbox fails to build "maven-compiler-plugin" Compilation failure

with no

warnings or info on what failed even with -e -X.

Which java 8? I remember this came with early jdk8 versions, or when
java wasn't there.

Another possibility would be to update the maven-compiler-plugin to
3.8.0.




On a mac with Java 8, I'm getting test failures

testFlattenPDFBOX563() and

testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
build it.


I've disabled them now, likely small rendering differences.

Tilman


All is good for now, but weird...

On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler 
@Tim
Cool, yes, please! I'm going to postpone the release for a couple of

days

depending on the results.

Thanks in advance!

Andreas

Am 30.05.20 um 12:59 schrieb Tim Allison:

I can run the tests on Monday w results by the end of the day EDT if
desired.

On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler <

andr...@lehmi.de>

wrote:


Hi,

I just realized that we didn't run Tims tests yet.  I've had a

look at

the

tickets and most of them are not related to text extraction. The

remaining

are
mostly dealing 

Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
Do we need this line?  Are we getting benefit from specifying a different
jdk via JAVA_HOME that we wouldn't get from letting maven use the active
"alternative"?

Anyways, all good.  Sorry for the noise.

On Mon, Jun 1, 2020 at 2:43 PM Tim Allison  wrote:

> User error...I recently reimaged my laptop and forgot to set JAVA_HOME,
> which needs to be picked up here:
>
> ${env.JAVA_HOME}
>
> for use here:
>
> ${jdk.path}/bin/javac
>
>
> On Mon, Jun 1, 2020 at 2:37 PM Tim Allison  wrote:
>
>> I get the same behavior (compiler failing at fontbox) with Java 11 on
>> ubuntu.  I'm sure this is user error, but it is weird.
>>
>> Apache Maven 3.6.3
>> Maven home: /usr/share/maven
>> Java version: 11.0.7, vendor: AdoptOpenJDK, runtime:
>> /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
>> Default locale: en_US, platform encoding: UTF-8
>> OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
>> "unix"
>>
>> On Mon, Jun 1, 2020 at 2:24 PM Tim Allison  wrote:
>>
>>> Thank you, Tilman.
>>>
>>> Apache Maven 3.6.3
>>> Maven home: /usr/share/maven
>>> Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
>>> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
>>> Default locale: en_US, platform encoding: UTF-8
>>> OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
>>> "unix"
>>>
>>> I specified 3.8.0 and 3.8.1 and still got the following...with no useful
>>> information...or...where do I look for useful information?
>>>
>>> [INFO]
>>> 
>>> [INFO] Reactor Summary for PDFBox reactor 2.0.20-SNAPSHOT:
>>> [INFO]
>>> [INFO] PDFBox parent .. SUCCESS [
>>>  2.719 s]
>>> [INFO] Apache FontBox . FAILURE [
>>>  1.304 s]
>>> [INFO] Apache XmpBox .. SKIPPED
>>> [INFO] Apache PDFBox .. SKIPPED
>>> [INFO] Apache Preflight ... SKIPPED
>>> [INFO] Apache Preflight application ... SKIPPED
>>> [INFO] Apache PDFBox Debugger . SKIPPED
>>> [INFO] Apache PDFBox tools  SKIPPED
>>> [INFO] Apache PDFBox application .. SKIPPED
>>> [INFO] Apache PDFBox Debugger application . SKIPPED
>>> [INFO] Apache PDFBox examples . SKIPPED
>>> [INFO] PDFBox reactor . SKIPPED
>>> [INFO]
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> 
>>> [INFO] Total time:  5.190 s
>>> [INFO] Finished at: 2020-06-01T14:21:11-04:00
>>> [INFO]
>>> 
>>> [ERROR] Failed to execute goal
>>> org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile
>>> (default-compile) on project fontbox: Compilation failure -> [Help 1]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>>> -e switch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>> [ERROR]
>>> [ERROR] For more information about the errors and possible solutions,
>>> please read the following articles:
>>> [ERROR] [Help 1]
>>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>>> [ERROR]
>>> [ERROR] After correcting the problems, you can resume the build with the
>>> command
>>> [ERROR]   mvn  -rf :fontbox
>>>
>>> On Mon, Jun 1, 2020 at 1:07 PM Tilman Hausherr 
>>> wrote:
>>>
 Am 01.06.2020 um 18:51 schrieb Tim Allison:
 > I'm having problems building...likely user error.
 >
 > On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the
 commandline:
 > fontbox fails to build "maven-compiler-plugin" Compilation failure
 with no
 > warnings or info on what failed even with -e -X.

 Which java 8? I remember this came with early jdk8 versions, or when
 java wasn't there.

 Another possibility would be to update the maven-compiler-plugin to
 3.8.0.


 >
 >
 > On a mac with Java 8, I'm getting test failures
 testFlattenPDFBOX563() and
 > testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
 > build it.


 I've disabled them now, likely small rendering differences.

 Tilman

 >
 > All is good for now, but weird...
 >
 > On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler >>> >
 > wrote:
 >
 >> @Tim
 >> Cool, yes, please! I'm going to postpone the release for a couple of
 days
 >> depending on the results.
 >>
 >> Thanks in advance!
 >>
 >> Andreas
 >>
 >> Am 30.05.20 um 12:59 schrieb Tim Allison:
 >>> I can run the tests on Monday w results by the end of the day EDT if
 >>> desired.
 >>>

Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
User error...I recently reimaged my laptop and forgot to set JAVA_HOME,
which needs to be picked up here:

${env.JAVA_HOME}

for use here:

${jdk.path}/bin/javac


On Mon, Jun 1, 2020 at 2:37 PM Tim Allison  wrote:

> I get the same behavior (compiler failing at fontbox) with Java 11 on
> ubuntu.  I'm sure this is user error, but it is weird.
>
> Apache Maven 3.6.3
> Maven home: /usr/share/maven
> Java version: 11.0.7, vendor: AdoptOpenJDK, runtime:
> /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
> "unix"
>
> On Mon, Jun 1, 2020 at 2:24 PM Tim Allison  wrote:
>
>> Thank you, Tilman.
>>
>> Apache Maven 3.6.3
>> Maven home: /usr/share/maven
>> Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
>> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
>> Default locale: en_US, platform encoding: UTF-8
>> OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
>> "unix"
>>
>> I specified 3.8.0 and 3.8.1 and still got the following...with no useful
>> information...or...where do I look for useful information?
>>
>> [INFO]
>> 
>> [INFO] Reactor Summary for PDFBox reactor 2.0.20-SNAPSHOT:
>> [INFO]
>> [INFO] PDFBox parent .. SUCCESS [
>>  2.719 s]
>> [INFO] Apache FontBox . FAILURE [
>>  1.304 s]
>> [INFO] Apache XmpBox .. SKIPPED
>> [INFO] Apache PDFBox .. SKIPPED
>> [INFO] Apache Preflight ... SKIPPED
>> [INFO] Apache Preflight application ... SKIPPED
>> [INFO] Apache PDFBox Debugger . SKIPPED
>> [INFO] Apache PDFBox tools  SKIPPED
>> [INFO] Apache PDFBox application .. SKIPPED
>> [INFO] Apache PDFBox Debugger application . SKIPPED
>> [INFO] Apache PDFBox examples . SKIPPED
>> [INFO] PDFBox reactor . SKIPPED
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time:  5.190 s
>> [INFO] Finished at: 2020-06-01T14:21:11-04:00
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile
>> (default-compile) on project fontbox: Compilation failure -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>> [ERROR]
>> [ERROR] After correcting the problems, you can resume the build with the
>> command
>> [ERROR]   mvn  -rf :fontbox
>>
>> On Mon, Jun 1, 2020 at 1:07 PM Tilman Hausherr 
>> wrote:
>>
>>> Am 01.06.2020 um 18:51 schrieb Tim Allison:
>>> > I'm having problems building...likely user error.
>>> >
>>> > On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the
>>> commandline:
>>> > fontbox fails to build "maven-compiler-plugin" Compilation failure
>>> with no
>>> > warnings or info on what failed even with -e -X.
>>>
>>> Which java 8? I remember this came with early jdk8 versions, or when
>>> java wasn't there.
>>>
>>> Another possibility would be to update the maven-compiler-plugin to
>>> 3.8.0.
>>>
>>>
>>> >
>>> >
>>> > On a mac with Java 8, I'm getting test failures testFlattenPDFBOX563()
>>> and
>>> > testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
>>> > build it.
>>>
>>>
>>> I've disabled them now, likely small rendering differences.
>>>
>>> Tilman
>>>
>>> >
>>> > All is good for now, but weird...
>>> >
>>> > On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler 
>>> > wrote:
>>> >
>>> >> @Tim
>>> >> Cool, yes, please! I'm going to postpone the release for a couple of
>>> days
>>> >> depending on the results.
>>> >>
>>> >> Thanks in advance!
>>> >>
>>> >> Andreas
>>> >>
>>> >> Am 30.05.20 um 12:59 schrieb Tim Allison:
>>> >>> I can run the tests on Monday w results by the end of the day EDT if
>>> >>> desired.
>>> >>>
>>> >>> On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler <
>>> andr...@lehmi.de>
>>> >>> wrote:
>>> >>>
>>>  Hi,
>>> 
>>>  I just realized that we didn't run Tims tests yet.  I've had a look
>>> at
>>> >> the
>>>  tickets and most of them are not related to text extraction. The
>>> >> remaining
>>>  are
>>>  mostly dealing with corner cases so that we should be save.
>>> 

Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
I get the same behavior (compiler failing at fontbox) with Java 11 on
ubuntu.  I'm sure this is user error, but it is weird.

Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 11.0.7, vendor: AdoptOpenJDK, runtime:
/usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family: "unix"

On Mon, Jun 1, 2020 at 2:24 PM Tim Allison  wrote:

> Thank you, Tilman.
>
> Apache Maven 3.6.3
> Maven home: /usr/share/maven
> Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family:
> "unix"
>
> I specified 3.8.0 and 3.8.1 and still got the following...with no useful
> information...or...where do I look for useful information?
>
> [INFO]
> 
> [INFO] Reactor Summary for PDFBox reactor 2.0.20-SNAPSHOT:
> [INFO]
> [INFO] PDFBox parent .. SUCCESS [
>  2.719 s]
> [INFO] Apache FontBox . FAILURE [
>  1.304 s]
> [INFO] Apache XmpBox .. SKIPPED
> [INFO] Apache PDFBox .. SKIPPED
> [INFO] Apache Preflight ... SKIPPED
> [INFO] Apache Preflight application ... SKIPPED
> [INFO] Apache PDFBox Debugger . SKIPPED
> [INFO] Apache PDFBox tools  SKIPPED
> [INFO] Apache PDFBox application .. SKIPPED
> [INFO] Apache PDFBox Debugger application . SKIPPED
> [INFO] Apache PDFBox examples . SKIPPED
> [INFO] PDFBox reactor . SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time:  5.190 s
> [INFO] Finished at: 2020-06-01T14:21:11-04:00
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile
> (default-compile) on project fontbox: Compilation failure -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :fontbox
>
> On Mon, Jun 1, 2020 at 1:07 PM Tilman Hausherr 
> wrote:
>
>> Am 01.06.2020 um 18:51 schrieb Tim Allison:
>> > I'm having problems building...likely user error.
>> >
>> > On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the
>> commandline:
>> > fontbox fails to build "maven-compiler-plugin" Compilation failure with
>> no
>> > warnings or info on what failed even with -e -X.
>>
>> Which java 8? I remember this came with early jdk8 versions, or when
>> java wasn't there.
>>
>> Another possibility would be to update the maven-compiler-plugin to 3.8.0.
>>
>>
>> >
>> >
>> > On a mac with Java 8, I'm getting test failures testFlattenPDFBOX563()
>> and
>> > testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
>> > build it.
>>
>>
>> I've disabled them now, likely small rendering differences.
>>
>> Tilman
>>
>> >
>> > All is good for now, but weird...
>> >
>> > On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler 
>> > wrote:
>> >
>> >> @Tim
>> >> Cool, yes, please! I'm going to postpone the release for a couple of
>> days
>> >> depending on the results.
>> >>
>> >> Thanks in advance!
>> >>
>> >> Andreas
>> >>
>> >> Am 30.05.20 um 12:59 schrieb Tim Allison:
>> >>> I can run the tests on Monday w results by the end of the day EDT if
>> >>> desired.
>> >>>
>> >>> On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler > >
>> >>> wrote:
>> >>>
>>  Hi,
>> 
>>  I just realized that we didn't run Tims tests yet.  I've had a look
>> at
>> >> the
>>  tickets and most of them are not related to text extraction. The
>> >> remaining
>>  are
>>  mostly dealing with corner cases so that we should be save.
>> 
>>  WDYT, are we save enough or do we need to run the tests before
>> cutting
>> >> the
>>  release?
>> 
>>  Andreas
>> 
>>  Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:
>> > I'm planning to cut the release on next Monday 1st of June.
>> >
>> > Andreas
>> >
>> > Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:
>> >> Hi,

Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
Thank you, Tilman.

Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family: "unix"

I specified 3.8.0 and 3.8.1 and still got the following...with no useful
information...or...where do I look for useful information?

[INFO]

[INFO] Reactor Summary for PDFBox reactor 2.0.20-SNAPSHOT:
[INFO]
[INFO] PDFBox parent .. SUCCESS [
 2.719 s]
[INFO] Apache FontBox . FAILURE [
 1.304 s]
[INFO] Apache XmpBox .. SKIPPED
[INFO] Apache PDFBox .. SKIPPED
[INFO] Apache Preflight ... SKIPPED
[INFO] Apache Preflight application ... SKIPPED
[INFO] Apache PDFBox Debugger . SKIPPED
[INFO] Apache PDFBox tools  SKIPPED
[INFO] Apache PDFBox application .. SKIPPED
[INFO] Apache PDFBox Debugger application . SKIPPED
[INFO] Apache PDFBox examples . SKIPPED
[INFO] PDFBox reactor . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time:  5.190 s
[INFO] Finished at: 2020-06-01T14:21:11-04:00
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile
(default-compile) on project fontbox: Compilation failure -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :fontbox

On Mon, Jun 1, 2020 at 1:07 PM Tilman Hausherr 
wrote:

> Am 01.06.2020 um 18:51 schrieb Tim Allison:
> > I'm having problems building...likely user error.
> >
> > On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the
> commandline:
> > fontbox fails to build "maven-compiler-plugin" Compilation failure with
> no
> > warnings or info on what failed even with -e -X.
>
> Which java 8? I remember this came with early jdk8 versions, or when
> java wasn't there.
>
> Another possibility would be to update the maven-compiler-plugin to 3.8.0.
>
>
> >
> >
> > On a mac with Java 8, I'm getting test failures testFlattenPDFBOX563()
> and
> > testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
> > build it.
>
>
> I've disabled them now, likely small rendering differences.
>
> Tilman
>
> >
> > All is good for now, but weird...
> >
> > On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler 
> > wrote:
> >
> >> @Tim
> >> Cool, yes, please! I'm going to postpone the release for a couple of
> days
> >> depending on the results.
> >>
> >> Thanks in advance!
> >>
> >> Andreas
> >>
> >> Am 30.05.20 um 12:59 schrieb Tim Allison:
> >>> I can run the tests on Monday w results by the end of the day EDT if
> >>> desired.
> >>>
> >>> On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler 
> >>> wrote:
> >>>
>  Hi,
> 
>  I just realized that we didn't run Tims tests yet.  I've had a look at
> >> the
>  tickets and most of them are not related to text extraction. The
> >> remaining
>  are
>  mostly dealing with corner cases so that we should be save.
> 
>  WDYT, are we save enough or do we need to run the tests before cutting
> >> the
>  release?
> 
>  Andreas
> 
>  Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:
> > I'm planning to cut the release on next Monday 1st of June.
> >
> > Andreas
> >
> > Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:
> >> Hi,
> >>
> >> how about cutting a 2.0.20 release in 2 or 3 weeks from now?
> >>
> >> Andreas
> >>
> >>
> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
> 
>  

Re: Release 2.0.20 ?

2020-06-01 Thread Tilman Hausherr

Am 01.06.2020 um 18:51 schrieb Tim Allison:

I'm having problems building...likely user error.

On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the commandline:
fontbox fails to build "maven-compiler-plugin" Compilation failure with no
warnings or info on what failed even with -e -X.


Which java 8? I remember this came with early jdk8 versions, or when 
java wasn't there.


Another possibility would be to update the maven-compiler-plugin to 3.8.0.





On a mac with Java 8, I'm getting test failures testFlattenPDFBOX563() and
testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
build it.



I've disabled them now, likely small rendering differences.

Tilman



All is good for now, but weird...

On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler 
wrote:


@Tim
Cool, yes, please! I'm going to postpone the release for a couple of days
depending on the results.

Thanks in advance!

Andreas

Am 30.05.20 um 12:59 schrieb Tim Allison:

I can run the tests on Monday w results by the end of the day EDT if
desired.

On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler 
wrote:


Hi,

I just realized that we didn't run Tims tests yet.  I've had a look at

the

tickets and most of them are not related to text extraction. The

remaining

are
mostly dealing with corner cases so that we should be save.

WDYT, are we save enough or do we need to run the tests before cutting

the

release?

Andreas

Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:

I'm planning to cut the release on next Monday 1st of June.

Andreas

Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:

Hi,

how about cutting a 2.0.20 release in 2 or 3 weeks from now?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-06-01 Thread Tim Allison
I'm having problems building...likely user error.

On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the commandline:
fontbox fails to build "maven-compiler-plugin" Compilation failure with no
warnings or info on what failed even with -e -X.


On a mac with Java 8, I'm getting test failures testFlattenPDFBOX563() and
testFlattenPDFBOX2469Filled().  If I disable those tests, I am able to
build it.

All is good for now, but weird...

On Sat, May 30, 2020 at 7:37 AM Andreas Lehmkuehler 
wrote:

> @Tim
> Cool, yes, please! I'm going to postpone the release for a couple of days
> depending on the results.
>
> Thanks in advance!
>
> Andreas
>
> Am 30.05.20 um 12:59 schrieb Tim Allison:
> > I can run the tests on Monday w results by the end of the day EDT if
> > desired.
> >
> > On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler 
> > wrote:
> >
> >> Hi,
> >>
> >> I just realized that we didn't run Tims tests yet.  I've had a look at
> the
> >> tickets and most of them are not related to text extraction. The
> remaining
> >> are
> >> mostly dealing with corner cases so that we should be save.
> >>
> >> WDYT, are we save enough or do we need to run the tests before cutting
> the
> >> release?
> >>
> >> Andreas
> >>
> >> Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:
> >>> I'm planning to cut the release on next Monday 1st of June.
> >>>
> >>> Andreas
> >>>
> >>> Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:
>  Hi,
> 
>  how about cutting a 2.0.20 release in 2 or 3 weeks from now?
> 
>  Andreas
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>  For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >>
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>


Re: Release 2.0.20 ?

2020-05-30 Thread Andreas Lehmkuehler

@Tim
Cool, yes, please! I'm going to postpone the release for a couple of days 
depending on the results.


Thanks in advance!

Andreas

Am 30.05.20 um 12:59 schrieb Tim Allison:

I can run the tests on Monday w results by the end of the day EDT if
desired.

On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler 
wrote:


Hi,

I just realized that we didn't run Tims tests yet.  I've had a look at the
tickets and most of them are not related to text extraction. The remaining
are
mostly dealing with corner cases so that we should be save.

WDYT, are we save enough or do we need to run the tests before cutting the
release?

Andreas

Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:

I'm planning to cut the release on next Monday 1st of June.

Andreas

Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:

Hi,

how about cutting a 2.0.20 release in 2 or 3 weeks from now?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org







-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-05-30 Thread Tim Allison
I can run the tests on Monday w results by the end of the day EDT if
desired.

On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler 
wrote:

> Hi,
>
> I just realized that we didn't run Tims tests yet.  I've had a look at the
> tickets and most of them are not related to text extraction. The remaining
> are
> mostly dealing with corner cases so that we should be save.
>
> WDYT, are we save enough or do we need to run the tests before cutting the
> release?
>
> Andreas
>
> Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:
> > I'm planning to cut the release on next Monday 1st of June.
> >
> > Andreas
> >
> > Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:
> >> Hi,
> >>
> >> how about cutting a 2.0.20 release in 2 or 3 weeks from now?
> >>
> >> Andreas
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>


Re: Release 2.0.20 ?

2020-05-30 Thread Andreas Lehmkuehler

Hi,

I just realized that we didn't run Tims tests yet.  I've had a look at the 
tickets and most of them are not related to text extraction. The remaining are 
mostly dealing with corner cases so that we should be save.


WDYT, are we save enough or do we need to run the tests before cutting the 
release?

Andreas

Am 26.05.20 um 08:02 schrieb Andreas Lehmkuehler:

I'm planning to cut the release on next Monday 1st of June.

Andreas

Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:

Hi,

how about cutting a 2.0.20 release in 2 or 3 weeks from now?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-05-26 Thread Andreas Lehmkuehler

I'm planning to cut the release on next Monday 1st of June.

Andreas

Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:

Hi,

how about cutting a 2.0.20 release in 2 or 3 weeks from now?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-05-20 Thread Maruan Sahyoun
+1

BR Maruan 
> Hi,
> 
> how about cutting a 2.0.20 release in 2 or 3 weeks from now?
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Release 2.0.20 ?

2020-05-20 Thread Tilman Hausherr

+1

Tilman

Am 20.05.2020 um 08:15 schrieb Andreas Lehmkuehler:

Hi,

how about cutting a 2.0.20 release in 2 or 3 weeks from now?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Release 2.0.20 ?

2020-05-20 Thread Andreas Lehmkuehler

Hi,

how about cutting a 2.0.20 release in 2 or 3 weeks from now?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org