Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Andreas Lehmkuehler
Am 02.06.20 um 23:29 schrieb Tim Allison: https://issues.apache.org/jira/browse/INFRA-20372 On Slack, Gavin suggested something like corpora.tika.apache.org. I'm happy with corpora.pdfbox.apache.org or anything else. Please let us know what you think over on that ticket. IMHO it should be

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Andreas Lehmkuehler
Am 02.06.20 um 22:20 schrieb Maruan Sahyoun: Maruan, To confirm, you're ok if we grant access to the server to our colleagues on Tika and POI? to be clear - my company is only sponsoring the box. It's the projects decision who needs access not mine. So feel free. Thanks a lot Maruan!

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
> I'm rsync'ing the data over now. I probably won't get around to setting up > httpd this week, but if anyone else wants to take it, go for it. This will > at least get team members access to the files asap. I can take care of httpd but would prefer to wait until the subdomain/cert is done

[jira] [Commented] (PDFBOX-4856) Can't read the embedded Type1C font

2020-06-02 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124582#comment-17124582 ] Tilman Hausherr commented on PDFBOX-4856: - It is difficult, often impossible to do anything when

[jira] [Created] (PDFBOX-4856) Can't read the embedded Type1C font

2020-06-02 Thread Dushyanth Balasubramanian (Jira)
Dushyanth Balasubramanian created PDFBOX-4856: - Summary: Can't read the embedded Type1C font Key: PDFBOX-4856 URL: https://issues.apache.org/jira/browse/PDFBOX-4856 Project: PDFBox

[jira] [Commented] (PDFBOX-3654) Parse error reading embedded Type1 font

2020-06-02 Thread Dushyanth Balasubramanian (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124481#comment-17124481 ] Dushyanth Balasubramanian commented on PDFBOX-3654: --- Thanks [~lehmi] Let me do that.

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
https://issues.apache.org/jira/browse/INFRA-20372 On Slack, Gavin suggested something like corpora.tika.apache.org. I'm happy with corpora.pdfbox.apache.org or anything else. Please let us know what you think over on that ticket. Thank you, again! Cheers, Tim On Tue, Jun 2,

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
I'm rsync'ing the data over now. I probably won't get around to setting up httpd this week, but if anyone else wants to take it, go for it. This will at least get team members access to the files asap. I've disabled login via password. If anyone feels that I'm doing something wrong, please let

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
> Maruan, > To confirm, you're ok if we grant access to the server to our colleagues > on Tika and POI? to be clear - my company is only sponsoring the box. It's the projects decision who needs access not mine. So feel free. BR Maruan > Again, wow, THANK YOU! > >Best, >

Re: New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
Maruan, To confirm, you're ok if we grant access to the server to our colleagues on Tika and POI? Again, wow, THANK YOU! Best, Tim On Tue, Jun 2, 2020 at 3:57 PM Tim Allison wrote: > >proper domain for https access > > I just pinged infra on slack.

New file vm WAS: Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
>proper domain for https access I just pinged infra on slack. If they're able to do it, what would we want? file-corpora.apache.org corpora.apache.org corpora-pdfbox.apache.org corpora-tika.apache.org Something else? I'm also happy to buy a domain if that won't work. There are a couple

[jira] [Commented] (PDFBOX-4071) Improve code quality (3)

2020-06-02 Thread ASF subversion and git services (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124116#comment-17124116 ] ASF subversion and git services commented on PDFBOX-4071: - Commit 1878402 from

Re: Release 2.0.20 ?

2020-06-02 Thread Tilman Hausherr
Am 02.06.2020 um 19:24 schrieb Maruan Sahyoun: Order placed. Once the server is available and the initial setup done I'll post here. Should be done by end of week depending on my other workload. Thanks!! Tilman - To

Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
> > > AMD ryzen looks fantastic. Others would be great as well. > > > > If ubuntu is possible at all, that's what I've been working with most > > recently. > > OK - will setup with that distro > > > Other than that, ssh access and sudo privileges would be all I'd need. > > > > Are you ok

[jira] [Commented] (PDFBOX-4855) Add support for memory mapped file reading

2020-06-02 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124082#comment-17124082 ] Tilman Hausherr commented on PDFBOX-4855: - Two tests in CCITTFactoryTest fail on Windows

Re: Release 2.0.20 ?

2020-06-02 Thread Tilman Hausherr
After checking two actual files (thanks Tim) I agree. The differences are minor and related to cases where it is difficult to get anything. Other differences are improvements. Tilman Am 02.06.2020 um 02:58 schrieb Tim Allison: Reports are available here:

Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
> AMD ryzen looks fantastic. Others would be great as well. > > If ubuntu is possible at all, that's what I've been working with most > recently. OK - will setup with that distro > > Other than that, ssh access and sudo privileges would be all I'd need. > > Are you ok if we set up apache

Re: Release 2.0.20 ?

2020-06-02 Thread Tilman Hausherr
After checking two actual files (thanks Tim) I agree. The differences are minor and related to cases where it is difficult to get anything. Other differences are improvements. Tilman Am 02.06.2020 um 02:58 schrieb Tim Allison: Reports are available here:

Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
AMD ryzen looks fantastic. Others would be great as well. If ubuntu is possible at all, that's what I've been working with most recently. Other than that, ssh access and sudo privileges would be all I'd need. Are you ok if we set up apache httpd to host files for the public or will this be a

[jira] [Commented] (PDFBOX-4836) Reduce the usage of ScatchFileBuffer when parsing a pdf

2020-06-02 Thread ASF subversion and git services (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123941#comment-17123941 ] ASF subversion and git services commented on PDFBOX-4836: - Commit 1878398 from

[jira] [Commented] (PDFBOX-4836) Reduce the usage of ScatchFileBuffer when parsing a pdf

2020-06-02 Thread ASF subversion and git services (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123907#comment-17123907 ] ASF subversion and git services commented on PDFBOX-4836: - Commit 1878397 from

Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
I'd be more than happy to help with maintenance. This would be AMAZING! On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun wrote: > Could fund either: > > AMD Ryzen 5 3600 > 64 GB RAM > 2x2TB > > or > > AMD Ryzen 7 3700X based Server > 64 GB RAM > 2x8TB > > or > Intel® Core™ i9-9900K > 64 GB RAM >

Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
Could fund either: AMD Ryzen 5 3600 64 GB RAM 2x2TB or AMD Ryzen 7 3700X based Server 64 GB RAM 2x8TB or Intel® Core™ i9-9900K 64 GB RAM 2x8TB All are root servers so one has to vote for taking care of them (I can do the initial setup). BR Maruan > There are two use cases. > > 1) host

Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
There are two use cases. 1) host shared data so that we can all point to and work from the same data, ideally both literal docs and also extracts (text/metadata .json files representing extracted information). 2) a modest vm to allow all of us to run the regression tests We could use help with

Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
is that a storage box only or does it need to do some computings too? Maybe you could write a small spec for the server requirement? BR Maruan > Still haven’t had time to put the server in a dmz. Ugh. > > Yes, more than happy to share. > > If anyone has recommendations for file hosting for

Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
Our commoncrawl slice+bugtrackers are currently 1 TB, govdocs1 is another .5 TB. 2 TB would safely cover the source documents that we're currently using. On Tue, Jun 2, 2020 at 6:08 AM Maruan Sahyoun wrote: > How many TB would that be? > > > Still haven’t had time to put the server in a dmz.

Re: Release 2.0.20 ?

2020-06-02 Thread Maruan Sahyoun
How many TB would that be? > Still haven’t had time to put the server in a dmz. Ugh. > > Yes, more than happy to share. > > If anyone has recommendations for file hosting for a couple of TB, let me > know. > > One option would be to work with CommonCrawl to bump the max file size one > crawl

Re: Release 2.0.20 ?

2020-06-02 Thread Tim Allison
Still haven’t had time to put the server in a dmz. Ugh. Yes, more than happy to share. If anyone has recommendations for file hosting for a couple of TB, let me know. One option would be to work with CommonCrawl to bump the max file size one crawl a year... On Tue, Jun 2, 2020 at 1:48 AM

[jira] [Commented] (PDFBOX-4848) Automate building website without local install

2020-06-02 Thread Jira
[ https://issues.apache.org/jira/browse/PDFBOX-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123551#comment-17123551 ] Andreas Lehmkühler commented on PDFBOX-4848: Sounds good to me > Automate building website

[jira] [Commented] (PDFBOX-3654) Parse error reading embedded Type1 font

2020-06-02 Thread Jira
[ https://issues.apache.org/jira/browse/PDFBOX-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123546#comment-17123546 ] Andreas Lehmkühler commented on PDFBOX-3654: [~dbalasub] Your issue is most likely not

[jira] [Commented] (PDFBOX-3654) Parse error reading embedded Type1 font

2020-06-02 Thread Dushyanth Balasubramanian (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123482#comment-17123482 ] Dushyanth Balasubramanian commented on PDFBOX-3654: --- Looks like this issue is not