>I'd go for corpora.tika.apache.org too.
Infra ticket updated. Thank you, all!
On Wed, Jun 3, 2020 at 2:07 AM Maruan Sahyoun
wrote:
>
> > Am 02.06.20 um 23:29 schrieb Tim Allison:
> > > https://issues.apache.org/jira/browse/INFRA-20372
> > >
> > > On Slack, Gavin suggested something like corpo
> Am 02.06.20 um 23:29 schrieb Tim Allison:
> > https://issues.apache.org/jira/browse/INFRA-20372
> >
> > On Slack, Gavin suggested something like corpora.tika.apache.org. I'm
> > happy with corpora.pdfbox.apache.org or anything else. Please let us know
> > what you think over on that ticket.
> Am 02.06.20 um 22:20 schrieb Maruan Sahyoun:
> >
> > > Maruan,
> > >To confirm, you're ok if we grant access to the server to our
> > > colleagues
> > > on Tika and POI?
> >
> > to be clear - my company is only sponsoring the box. It's the projects
> > decision who needs access not mi
Thanks Tim and Tilman,
it looks like we are good to go. I'm going to cut the release tomorrow evening
CEST.
Andreas
Am 02.06.20 um 19:12 schrieb Tilman Hausherr:
After checking two actual files (thanks Tim) I agree. The differences are minor
and related to cases where it is difficult to get a
Am 02.06.20 um 23:29 schrieb Tim Allison:
https://issues.apache.org/jira/browse/INFRA-20372
On Slack, Gavin suggested something like corpora.tika.apache.org. I'm
happy with corpora.pdfbox.apache.org or anything else. Please let us know
what you think over on that ticket.
IMHO it should be eith
Am 02.06.20 um 22:20 schrieb Maruan Sahyoun:
Maruan,
To confirm, you're ok if we grant access to the server to our colleagues
on Tika and POI?
to be clear - my company is only sponsoring the box. It's the projects decision
who needs access not mine. So feel free.
Thanks a lot Maruan! Sh
> I'm rsync'ing the data over now. I probably won't get around to setting up
> httpd this week, but if anyone else wants to take it, go for it. This will
> at least get team members access to the files asap.
I can take care of httpd but would prefer to wait until the subdomain/cert is
done
https://issues.apache.org/jira/browse/INFRA-20372
On Slack, Gavin suggested something like corpora.tika.apache.org. I'm
happy with corpora.pdfbox.apache.org or anything else. Please let us know
what you think over on that ticket.
Thank you, again!
Cheers,
Tim
On Tue, Jun 2, 2020
I'm rsync'ing the data over now. I probably won't get around to setting up
httpd this week, but if anyone else wants to take it, go for it. This will
at least get team members access to the files asap.
I've disabled login via password.
If anyone feels that I'm doing something wrong, please let
> Maruan,
> To confirm, you're ok if we grant access to the server to our colleagues
> on Tika and POI?
to be clear - my company is only sponsoring the box. It's the projects decision
who needs access not mine. So feel free.
BR
Maruan
> Again, wow, THANK YOU!
>
>Best,
>
Maruan,
To confirm, you're ok if we grant access to the server to our colleagues
on Tika and POI?
Again, wow, THANK YOU!
Best,
Tim
On Tue, Jun 2, 2020 at 3:57 PM Tim Allison wrote:
> >proper domain for https access
>
> I just pinged infra on slack.
>proper domain for https access
I just pinged infra on slack.
If they're able to do it, what would we want?
file-corpora.apache.org
corpora.apache.org
corpora-pdfbox.apache.org
corpora-tika.apache.org
Something else? I'm also happy to buy a domain if that won't work. There
are a couple availa
Am 02.06.2020 um 19:24 schrieb Maruan Sahyoun:
Order placed. Once the server is available and the initial setup done I'll post
here. Should be done by end of week depending on
my other workload.
Thanks!!
Tilman
-
To unsubs
>
> > AMD ryzen looks fantastic. Others would be great as well.
> >
> > If ubuntu is possible at all, that's what I've been working with most
> > recently.
>
> OK - will setup with that distro
>
> > Other than that, ssh access and sudo privileges would be all I'd need.
> >
> > Are you ok i
After checking two actual files (thanks Tim) I agree. The differences
are minor and related to cases where it is difficult to get anything.
Other differences are improvements.
Tilman
Am 02.06.2020 um 02:58 schrieb Tim Allison:
Reports are available here:
https://github.com/tballison/share/
> AMD ryzen looks fantastic. Others would be great as well.
>
> If ubuntu is possible at all, that's what I've been working with most
> recently.
OK - will setup with that distro
>
> Other than that, ssh access and sudo privileges would be all I'd need.
>
> Are you ok if we set up apache ht
After checking two actual files (thanks Tim) I agree. The differences
are minor and related to cases where it is difficult to get anything.
Other differences are improvements.
Tilman
Am 02.06.2020 um 02:58 schrieb Tim Allison:
Reports are available here:
https://github.com/tballison/share/
AMD ryzen looks fantastic. Others would be great as well.
If ubuntu is possible at all, that's what I've been working with most
recently.
Other than that, ssh access and sudo privileges would be all I'd need.
Are you ok if we set up apache httpd to host files for the public or will
this be a co
I'd be more than happy to help with maintenance. This would be AMAZING!
On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun
wrote:
> Could fund either:
>
> AMD Ryzen 5 3600
> 64 GB RAM
> 2x2TB
>
> or
>
> AMD Ryzen 7 3700X based Server
> 64 GB RAM
> 2x8TB
>
> or
> Intel® Core™ i9-9900K
> 64 GB RAM
> 2
Could fund either:
AMD Ryzen 5 3600
64 GB RAM
2x2TB
or
AMD Ryzen 7 3700X based Server
64 GB RAM
2x8TB
or
Intel® Core™ i9-9900K
64 GB RAM
2x8TB
All are root servers so one has to vote for taking care of them (I can do the
initial setup).
BR
Maruan
> There are two use cases.
>
> 1) host
There are two use cases.
1) host shared data so that we can all point to and work from the same
data, ideally both literal docs and also extracts (text/metadata .json
files representing extracted information).
2) a modest vm to allow all of us to run the regression tests
We could use help with e
is that a storage box only or does it need to do some computings too?
Maybe you could write a small spec for the server requirement?
BR
Maruan
> Still haven’t had time to put the server in a dmz. Ugh.
>
> Yes, more than happy to share.
>
> If anyone has recommendations for file hosting for
Our commoncrawl slice+bugtrackers are currently 1 TB, govdocs1 is another
.5 TB.
2 TB would safely cover the source documents that we're currently using.
On Tue, Jun 2, 2020 at 6:08 AM Maruan Sahyoun
wrote:
> How many TB would that be?
>
> > Still haven’t had time to put the server in a dmz.
How many TB would that be?
> Still haven’t had time to put the server in a dmz. Ugh.
>
> Yes, more than happy to share.
>
> If anyone has recommendations for file hosting for a couple of TB, let me
> know.
>
> One option would be to work with CommonCrawl to bump the max file size one
> crawl
Still haven’t had time to put the server in a dmz. Ugh.
Yes, more than happy to share.
If anyone has recommendations for file hosting for a couple of TB, let me
know.
One option would be to work with CommonCrawl to bump the max file size one
crawl a year...
On Tue, Jun 2, 2020 at 1:48 AM Tilma
Can we / I access these files? Most differences are improvements or not
meaningful, but there are a few I'd like to have a look, e.g.
commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
the word "antrag" loses the first "a". Although maybe the "a" was a big
one and gets assigned to a
>
>
>> Reports are available here:
https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
Looks like there are trivial differences in content with a slight
improvement over 2.0.19. I don't see any differences in exceptions or
attachments.
Cheers,
Tim
Got it. Thank you. That makes good sense. Onwards!
On Mon, Jun 1, 2020 at 3:51 PM Tilman Hausherr
wrote:
> Yes, we use this to test compiling on jenkins with a jdk6 system while
> running the build on a jdk8 system, at the request of Simon Steiner.
>
> It worked fine for a long time, although
Yes, we use this to test compiling on jenkins with a jdk6 system while
running the build on a jdk8 system, at the request of Simon Steiner.
It worked fine for a long time, although currently it doesn't. (Because
it works only with some maven versions)
Tilman
Am 01.06.2020 um 21:01 schrieb Ti
Do we need this line? Are we getting benefit from specifying a different
jdk via JAVA_HOME that we wouldn't get from letting maven use the active
"alternative"?
Anyways, all good. Sorry for the noise.
On Mon, Jun 1, 2020 at 2:43 PM Tim Allison wrote:
> User error...I recently reimaged my lapt
User error...I recently reimaged my laptop and forgot to set JAVA_HOME,
which needs to be picked up here:
${env.JAVA_HOME}
for use here:
${jdk.path}/bin/javac
On Mon, Jun 1, 2020 at 2:37 PM Tim Allison wrote:
> I get the same behavior (compiler failing at fontbox) with Java 11 on
> ubuntu.
I get the same behavior (compiler failing at fontbox) with Java 11 on
ubuntu. I'm sure this is user error, but it is weird.
Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 11.0.7, vendor: AdoptOpenJDK, runtime:
/usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
Default locale: en_US, platfo
Thank you, Tilman.
Apache Maven 3.6.3
Maven home: /usr/share/maven
Java version: 1.8.0_252, vendor: AdoptOpenJDK, runtime:
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-33-generic", arch: "amd64", family: "unix"
I s
Am 01.06.2020 um 18:51 schrieb Tim Allison:
I'm having problems building...likely user error.
On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the commandline:
fontbox fails to build "maven-compiler-plugin" Compilation failure with no
warnings or info on what failed even with -e -X.
I'm having problems building...likely user error.
On ubuntu with Java 8, maven 3.6.3, `mvn clean install` on the commandline:
fontbox fails to build "maven-compiler-plugin" Compilation failure with no
warnings or info on what failed even with -e -X.
On a mac with Java 8, I'm getting test failure
@Tim
Cool, yes, please! I'm going to postpone the release for a couple of days
depending on the results.
Thanks in advance!
Andreas
Am 30.05.20 um 12:59 schrieb Tim Allison:
I can run the tests on Monday w results by the end of the day EDT if
desired.
On Sat, May 30, 2020 at 5:53 AM Andreas
I can run the tests on Monday w results by the end of the day EDT if
desired.
On Sat, May 30, 2020 at 5:53 AM Andreas Lehmkuehler
wrote:
> Hi,
>
> I just realized that we didn't run Tims tests yet. I've had a look at the
> tickets and most of them are not related to text extraction. The remaini
Hi,
I just realized that we didn't run Tims tests yet. I've had a look at the
tickets and most of them are not related to text extraction. The remaining are
mostly dealing with corner cases so that we should be save.
WDYT, are we save enough or do we need to run the tests before cutting the
I'm planning to cut the release on next Monday 1st of June.
Andreas
Am 20.05.20 um 08:15 schrieb Andreas Lehmkuehler:
Hi,
how about cutting a 2.0.20 release in 2 or 3 weeks from now?
Andreas
-
To unsubscribe, e-mail: dev-uns
+1
BR Maruan
> Hi,
>
> how about cutting a 2.0.20 release in 2 or 3 weeks from now?
>
> Andreas
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
+1
Tilman
Am 20.05.2020 um 08:15 schrieb Andreas Lehmkuehler:
Hi,
how about cutting a 2.0.20 release in 2 or 3 weeks from now?
Andreas
-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-ma
Hi,
how about cutting a 2.0.20 release in 2 or 3 weeks from now?
Andreas
-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
42 matches
Mail list logo