Re: [VOTE] Release Apache Tika 3.0.0 Candidate #1

2024-10-16 Thread Oleg Tikhonov
+1
Oleg

On Wed, 16 Oct 2024 at 16:07 Tilman Hausherr  wrote:

> +1
>
> Tilman
>
> On 16.10.2024 13:24, Tim Allison wrote:
> > A candidate for the Tika 3.0.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/3.0.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/3.0.0-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> c5eb92bc895d96492b2d2577d14df6187e46ab7c8a9f64aaf19d4f140f07caf1223d073c2cbb47b5519bb952eee50f39563004b8ad49906f45dffc9b6df74350.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1107/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 3.0.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 3.0.0
> > [ ] -1 Do not release this package because...
> >
> >
> > Here's my +1.
> >
> > Thank you, all!
> >
> > Best,
> >
> >   Tim
>
>
>


Re: [VOTE] Release Apache Tika 2.9.2 Candidate #2

2024-04-01 Thread Oleg Tikhonov
+1,
Thanks.

On Mon, 1 Apr 2024 at 23:36 Tim Allison  wrote:

> Any fellow devs able to vote? We need one more vote. Thank you!
>
> On Tue, Mar 26, 2024 at 12:22 PM Tilman Hausherr 
> wrote:
>
> > +1
> >
> > successful build on Windows 10, oracle jdk 1.8.0_391
> >
> > Tilman
> >
> > On 26.03.2024 16:52, Tim Allison wrote:
> > > A candidate for the Tika 2.9.2 release is available at:
> > > https://dist.apache.org/repos/dist/dev/tika/2.9.2
> > >
> > > The release candidate is a zip archive of the sources in:
> > > https://github.com/apache/tika/tree/2.9.2-rc2/
> > >
> > > The SHA-512 checksum of the archive is
> > >
> >
> 5ac7b981aa89d44e177dfb457d6f6b73dd54d43641da31e76b3e8bd9dbc236b9d2e6f6958d9182f36cbee6409293f3f21421f9c89837f693f5e10f997e9b063c.
> > >
> > > In addition, a staged maven repository is available here:
> > >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1099/org/apache/tika
> > >
> > > Please vote on releasing this package as Apache Tika 2.9.2.
> > > The vote is open for the next 72 hours and passes if a majority of at
> > > least three +1 Tika PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Tika 2.9.2
> > > [ ] -1 Do not release this package because...
> > >
> > > Here's my +1
> > >
> > > Best,
> > >
> > >Tim
> >
> >
> >
>


Re: [VOTE] Release Apache Tika 2.9.1 Candidate #1

2023-10-18 Thread Oleg Tikhonov
+1
Jdk 8 and 11, ubuntu 20


On Tue, 17 Oct 2023 at 21:05 Tilman Hausherr  wrote:

> +1
>
> successful build on german windows on jdk 11.0.20
>
> Tilman
>
> On 17.10.2023 13:13, Tim Allison wrote:
> > A candidate for the Tika 2.9.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.9.1
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.9.1-rc1
> >
> > The SHA-512 checksum of the archive is
> >
> ba13a0d22994ca84cccd9ad2931e099051870d46a5a3440258f93bd63f6e3b03de51709c51cf0e4029e57ba9c44cdb243ac440d76e695dfc081dfd9d956d8777.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1096/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.9.1.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.9.1
> > [ ] -1 Do not release this package because...
> >
> > Best,
> >   Tim
> >
>
>


Re: [VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-25 Thread Oleg Tikhonov
Here is mine +1
Thanks

On Fri, 25 Aug 2023 at 11:48 Konstantin Gribov  wrote:

> Hi, folks.
>
> Built successfully on ArchLinux, OpenJdk 11 & 17 (Temurin 11.0.20+8 and
>  17.0.8+7) with tesseract 5.3.2 and leptonica 1.83.1.
>
> SHA512 and GPG signatures look fine to me.
>
> [x] +1 Release this package as Apache Tika 2.9.0
> [ ] -1 Do not release this package because...
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Wed, Aug 23, 2023 at 5:50 PM Tim Allison  wrote:
>
> > A candidate for the Tika 2.9.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.9.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.9.0-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> >
> 4b54172163a2e86b805e7077b11d21902dc2137a849eb0d58ca06a904a91007ed14ac78ee8266531ff62cd666059409d728b679c571304c7b672c6446d9c5a15.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1095/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.9.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.9.0
> > [ ] -1 Do not release this package because...
> >
>


Re: [VOTE] Release Apache Tika 2.8.0 Candidate #2

2023-05-13 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 2.8.0
Ubuntu 20.04, open jdk 11, basic stuff.


On Sat, 13 May 2023 at 18:03 Tilman Hausherr  wrote:

> +1
>
> Successful build on latest oracle jdk8 on german windows.
>
> Tilman
>
> On 11.05.2023 22:07, Tim Allison wrote:
> > A candidate for the Tika 2.8.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.8.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.8.0-rc2/
> >
> > The SHA-512 checksum of the archive is
> >
> b39d485c8046019fb9319d7d76c68d14b8494dea25619209058244cb567d0c51e6c243ca2a478d611e079ed47d64294c82cf9475889f23cd73cbba13ee4e6cd9.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1094/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.8.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.8.0
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Thank you!
> >
> > Best,
> >
> >  Tim
>
>
>


Re: [VOTE] Apache Tika 2.8.0 Release Candidate 1

2023-05-10 Thread Oleg Tikhonov
 [x] +1 Release this package as Apache Tika 2.8.0
Ubuntu 20, java 11.
Thanks,
Oleg



> On Tue, May 9, 2023, 11:40 AM Tim Allison  wrote:
>
> > A candidate for the Tika 2.8.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.8.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.8.0-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> >
> 6b514a45b87013c566e57af2b6a526bce0b3bf02a1dabefe998068aa49672ec4a7ec2ecfa538a84aca719607f339a44341caeaab1ca313fc1c161154ec095bbb.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1093/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.8.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.8.0
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Best,
> >
> > Tim
> >
>


Re: [VOTE] Release Apache Tika 2.7.0 Candidate #1

2023-02-02 Thread Oleg Tikhonov
Hey,
+1
Ubuntu, jdk 8 (Oracle).

Thanks,
Oleg

On Fri, Feb 3, 2023 at 6:09 AM Tilman Hausherr 
wrote:

> +1
>
> builds on german W10 with jdk8
>
> Tilman
>
> On 31.01.2023 20:13, Tim Allison wrote:
> > A candidate for the Tika 2.7.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.7.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.7.0-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> 7f3505f6a86b617a37f25f31f4c6b3e4028d2baab700a5fe4070d38d6f625dba3c18db4010da84acb71af14ffdb1259cc64ea10d8ec2a22fc56667bfe1b52ad7.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1092/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.7.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.7.0
> > [ ] -1 Do not release this package because...
> >
> >
> > Here's my +1.
> >
> > Thank you!
> >
> > Best,
> >
> >Tim
>
>
>


Re: [VOTE] Release Apache Tika 2.6.0 Candidate #1

2022-11-04 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 2.6.0

[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time:  38:54 min
[INFO] Finished at: 2022-11-04T23:47:16+02:00
[INFO]

oleg@oleg-vb:~/tika/tika$ java -version
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed
mode, sharing)

Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal

On Thu, Nov 3, 2022 at 10:19 PM Tilman Hausherr 
wrote:

> +1
>
> built on W10, jdk 1.8.0_351 and 11.0.16
>
> Tilman
>
> On 03.11.2022 14:47, Tim Allison wrote:
> > A candidate for the Tika 2.6.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.6.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.6.0-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> 6b1011304da6a43e17697695fa78f86bfafd6828be52baefadb9d562ea328e43a0ae99fa7e0f020a234173470ee29ae19c917c4562dfdc4cff27945bd7e46e69.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1091/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.6.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.6.0
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.  Thank you, all!
> >
> > Best,
> >
> >Tim
> >
>
>


Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-05 Thread Oleg Tikhonov
Hi Nick,
Honestly I am trying to port our project to gradle. But it goes not well.
It is good idea. Is some folk can help, we can do it together.
+1
Cheers,
Oleg

On Wed, Oct 5, 2022, 22:05 Nick Burch  wrote:

> Hi All
>
> At ApacheCon this week, a Bob and myself ended up chatting with the folks
> from Gradle, who are keen to help ASF projects, and are discussing with
> the Infra team.
>
> The easier bit - they think they might be able to help speed up our maven
> build, especially the running of tests. Anyone have some time to give that
> a try? Will pass details along to anyone with the volunteer cycles
>
> The interesting bit - we told them about the regression corpus, and they
> got very excited as it sounds completely different to most of their normal
> "my build is slow" type problems. The size of it, and the fact that it
> isn't a simple pass/fail, seemed to catch their interest. Anyone (though
> probably only Tim...) intersted in talking them through how it works, and
> maybe getting one of their team access to the VM?
>
> Cheers
> Nick
>


Re: [VOTE] Release Apache Tika 2.5.0 Candidate #1

2022-09-30 Thread Oleg Tikhonov
Ubuntu 20.04, java sdk 11,
+1

Thanks

On Fri, Sep 30, 2022, 21:33 Tilman Hausherr  wrote:

> +1
>
> builds on windows 10, oracle jdk1.8.0_341
>
> Tilman
>
> On 30.09.2022 16:12, Tim Allison wrote:
> > A candidate for the Tika 2.5.0 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.5.0
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.5.0-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> efaa33763dc5ddf01476ea6d10086595c372d8d80a269fc14ffe266becd247bf8080968c904cff8bcb460506de7ea492820d2fc0398b78017131bdb6e3e95b80.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1090/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.5.0.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.5.0
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Best,
> >
> >Tim
>
>
>


Re: [VOTE] Release Apache Tika 1.28.4 Candidate #1

2022-06-16 Thread Oleg Tikhonov
Hey,

[x] +1 Release this package as Apache Tika 1.28.4
Java 8, ubuntu 20, basic stuff.
Thanks,
Oleg

On Thu, Jun 16, 2022, 17:42 Konstantin Gribov  wrote:

> Built successfully on ArchLinux, OpenJDK 11 & 17 (Temurin-11.0.15+10 &
> 17.0.3+7) w/ Tesseract 5.1.0, Leptonica 1.82.
> The issue with the tesseract multipage test is still the same, it extracts
> "Page?2" instead of "Page 2" on my laptop.
>
> GPG signatures and SHA512 hashes are fine.
>
> [x] +1 Release this package as Apache Tika 1.28.4
> [ ] -1 Do not release this package because...
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Tue, Jun 14, 2022 at 8:08 PM Tilman Hausherr 
> wrote:
>
> > +1
> >
> > Builds on Windows jdk8 and jdk11, test fail in
> > GeographicInformationParserTest.testISO19139() on jdk18 but we discussed
> > this before
> >
> > Tilman
> >
> > Am 13.06.2022 um 19:52 schrieb Tim Allison:
> > > A candidate for the Tika 1.28.4 release is available at:
> > >https://dist.apache.org/repos/dist/dev/tika/1.28.4
> > >
> > > The release candidate is a zip archive of the sources in:
> > >https://github.com/apache/tika/tree/1.28.4-rc1/
> > >
> > > The SHA-512 checksum of the archive is
> > >
> > >
> >
> 14ef2ba2a6ab3171c73db23867bc3e55095ada7d7d79ae707c2a2ae2d9ff5fcb23397ba4ce369746149a60e788a83c6653691b771e63378e13f5bf9cb4a306f3.
> > >
> > > In addition, a staged maven repository is available here:
> > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1087/org/apache/tika
> > >
> > > Please vote on releasing this package as Apache Tika 1.28.4.
> > > The vote is open for the next 72 hours and passes if a majority of at
> > > least three +1 Tika PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Tika 1.28.4
> > > [ ] -1 Do not release this package because...
> > >
> > > Here's my +1.
> > >
> >
> >
>


Re: [VOTE] Release Apache Tika 2.4.1 Candidate #1

2022-06-15 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 2.4.1

[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time:  23:55 min
[INFO] Finished at: 2022-06-15T12:29:43+03:00
[INFO]


openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)




On Wed, Jun 15, 2022 at 7:07 AM Tilman Hausherr 
wrote:

> +1
>
> builds on jdk8, 11 and 18 on windows 10
>
> Tilman
>
> Am 14.06.2022 um 19:45 schrieb Tim Allison:
> > A candidate for the Tika 2.4.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.4.1
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.4.1-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> 50fab2e229a559db5eb4fc84d18506add032ba142784ca8db8a76d128ef1daa08994b37cdd8559f96908497f79fcf8f8c54185122cc8fa7451b0624a48e9a204.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1088/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.4.1.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.4.1
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Best,
> >
> > Tim
> >
>
>


Re: [VOTE] Release Apache Tika 1.28.3 Candidate #1

2022-05-26 Thread Oleg Tikhonov
Hi,
Here is +1, ubuntu, java 11, x86_64.
Thanks,
Oleg

On Thu, May 26, 2022, 11:04 Tilman Hausherr  wrote:

> +1
>
> Tilman
>
> Am 23.05.2022 um 20:38 schrieb Tim Allison:
> > I'm indifferent but lean slightly towards going forward as is.
> >
> > If anyone has a hesitation, I'm happy to revert the upgrade and re-roll
> rc2.
> >
> > On Mon, May 23, 2022 at 1:21 PM Tilman Hausherr 
> > wrote:
> >
> >> Am 23.05.2022 um 18:54 schrieb Tim Allison:
> >>> If you revert org.apache.sis to the earlier version, do you get a clean
> >>> build on jdk18?  I just upgraded that earlier today.
> >> Yes that works.
> >>
> >> So the question is, do we care about this? Are there people who would
> >> use the "outdated" tika 1.* but still update to jdk18?
> >>
> >> Tilman
> >>
> >>
> >>>
> >>>
> >>> On Mon, May 23, 2022 at 12:40 PM Tilman Hausherr <
> thaush...@t-online.de>
> >>> wrote:
> >>>
>  I get a failure when running a build on oracle jdk18 on windows (jdk8
>  and jdk11 builds are fine):
> 
>  [ERROR]
> 
> >>
> GeographicInformationParserTest.testISO19139:31->TikaTest.getXML:196->TikaTest.getXML:178->TikaTest.getXML:214
>  » Tika UnsupportedStorageException
> 
> 
> >>
>
>


Re: next release: 1.28.3?

2022-05-18 Thread Oleg Tikhonov
Good idea! +1.
Cheers,
Oleg

On Wed, May 18, 2022, 17:11 Tim Allison  wrote:

> All,
>   I propose kicking off a release for 1.28.3 early next week.  I've updated
> some dependencies.  What do you think?
>
>   Best,
>
>   Tim
>


Re: [VOTE] Release Apache Tika 2.4.0 Candidate #1

2022-04-29 Thread Oleg Tikhonov
Hi,
+1, Ubuntu 20, x86, Java 11.

Thanks!


> On 29 Apr 2022, at 2:23, Tim Allison  wrote:
> 
> A candidate for the Tika 2.4.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.4.0
> 
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/2.4.0-rc1/
> 
> The SHA-512 checksum of the archive is
> aff68637527fa4fa1ec21678ef2771a1dcd5eb3944bc1b1171c59459274295b903e093dc63ade0b6532bf137834d32bcb9cdf0d6a32efca187b9d6b8ac64f690.
> 
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1085/org/apache/tika
> 
> Please vote on releasing this package as Apache Tika 2.4.0.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 2.4.0
> [ ] -1 Do not release this package because...
> 
> Here's my +1
> 
> Best,
> 
>  Tim



Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-04-29 Thread Oleg Tikhonov
Hi,
+1.
Basic stuff, linux ubuntu 20, x86, java 11.
Thanks.

On Thu, Apr 28, 2022, 20:23 Tilman Hausherr  wrote:

> +1
>
> Tilman
>
> Am 28.04.2022 um 16:54 schrieb Tim Allison:
> > A candidate for the Tika 1.28.2 release is available at:
> >https://dist.apache.org/repos/dist/dev/tika/1.28.2
> >
> > The release candidate is a zip archive of the sources in:
> >https://github.com/apache/tika/tree/1.28.2-rc2/
> >
> > The SHA-512 checksum of the archive is
> >
> 035f3643a302e2a88f99ca549c4d5c5c6eecd7736d03e4a686b17028f519f6a7a40229e48f2aac0bdf1653391e0bd7d34d0c7d099a2e5a2cb6141df00a4181bf.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1083/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 1.28.2.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.28.2
> > [ ] -1 Do not release this package because...
> >
> >
> > Here's my +1.
> >
> > Best,
> >
> > Tim
>
>
>


Re: [VOTE] Release Apache Tika 1.28.1 Candidate #1

2022-02-10 Thread Oleg Tikhonov
+1 , ubuntu 20.04, open jdk 11.
Thanks,
Oleg

On Fri, Feb 11, 2022, 04:34 David Meikle  wrote:

> Hello,
>
> On Tue, 8 Feb 2022 at 18:22, Tim Allison  wrote:
>
> > A candidate for the Tika 1.28.1 release is available at:
> >   https://dist.apache.org/repos/dist/dev/tika/1.28.1
> >
> > The release candidate is a zip archive of the sources in:
> >   https://github.com/apache/tika/tree/1.28.1-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> >
> 17e92425d1cb53932d883b890a98491d5744345a75fa159bab90d47449470705b0c23aa75af845c0c5e4a2c175879eafd368b59e7559168f12722428d4b45fa4.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1081/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 1.28.1.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.28.1
> > [ ] -1 Do not release this package because...
> >
>
> +1 from me.
>
> Cheers,
> Dave
>


Re: [VOTE] Release Apache Tika 2.3.0 Candidate 1

2022-02-06 Thread Oleg Tikhonov
Hi,
Linux Ubuntu 20.04, java 11.
+1
Thanks,
Oleg

On Sun, Feb 6, 2022, 22:05 Konstantin Gribov  wrote:

> Hi, folks.
>
> SHA512 checksums and GPG signatures are fine.
>
> Built successfully on ArchLinux, OpenJDK 17 & 11 (Temurin-17.0.1+12 &
> Temurin-11.0.13+8), Tesseract 5.0.1-2, Leptonica 1.82.0-1.
> Same issue with tesseract on multipage tiff.
>
> [x] +1 Release this package as Apache Tika 2.3.0
> [ ] -1 Do not release this package because...
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Thu, Feb 3, 2022 at 10:03 PM Tilman Hausherr 
> wrote:
>
> > +1
> >
> > Successful build on german Windows 10. The build fail from a few days
> > ago is gone.
> >
> > Tilman
> >
> > Am 03.02.2022 um 02:29 schrieb Tim Allison:
> > > A candidate for the Tika 2.3.0 release is available at:
> > > https://dist.apache.org/repos/dist/dev/tika/2.3.0
> > >
> > > The release candidate is a zip archive of the sources in:
> > > https://github.com/apache/tika/tree/2.3.0-rc1/
> > >
> > > The SHA-512 checksum of the archive is
> > >
> >
> f7b50a72e59bbf0db46e34a0547ac35109319eb51a2b816b65353b648a615540d98f0f86852eb2aa3ff6c0892f57952750399f9969537feac5aed744de4eb5b3.
> > >
> > > In addition, a staged maven repository is available here:
> > >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1080/org/apache/tika
> > >
> > > Please vote on releasing this package as Apache Tika 2.3.0.
> > > The vote is open for the next 72 hours and passes if a majority of at
> > > least three +1 Tika PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Tika 2.3.0
> > > [ ] -1 Do not release this package because...
> > >
> > > Here's my +1.
> > >
> > > Cheers,
> > >
> > > Tim
> >
> >
> >
>


Re: [VOTE] Release Apache Tika 1.28 Candidate #3

2021-12-21 Thread Oleg Tikhonov
Hi,
[x] +1 Release this package as Apache Tika 1.28

mvn clean install -U OK

*OS and arch*:
Linux oleg-vb 5.11.0-41-generic #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

*Java version*:
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

Thanks,
Oleg

On Mon, Dec 20, 2021 at 11:49 PM David Meikle  wrote:

> On Mon, 20 Dec 2021 at 16:31, Tim Allison  wrote:
>
> >
> > The SHA-512 checksum of the archive is
> >
> >
> f8487f58aeec011c993ac46d8e99f8bed64333ccfa57edf8ff9773653204fa2a4e27cb1102e53c181ae7a1e98f892da4c1766f473ce5ee83c1b9229c4f8e5aec.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1079/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 1.28.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.28
> > [ ] -1 Do not release this package because.
> >
>
> +1, thanks Tim!
>
> Cheers,
> Dave
>


Re: [VOTE] Release Apache Tika 2.2.1 Candidate #3

2021-12-20 Thread Oleg Tikhonov
Hi,

[x] +1 Release this package as Apache Tika 2.2.1

mvn clean install -U *OK*

OS and arch:
Linux oleg-vb 5.11.0-41-generic #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10
UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Java version:
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

Thanks!
Oleg

On Tue, Dec 21, 2021 at 12:28 AM Dave Meikle  wrote:

> On Mon, 20 Dec 2021 at 15:59, Tim Allison  wrote:
>
> > A candidate for the Tika 2.2.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/2.2.1
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.2.1-rc3/
> >
> > The SHA-512 checksum of the archive is
> >
> >
> 42accd01d5f152a9a6b26883b735242fb6e9eb01f85f3ff752fc970d413c746dc0a875c89f9e3942012491954d59e3ea2b05ab4419a8876937b73fedb8c29a4e.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1078/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.2.1.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 2.2.1
> > [ ] -1 Do not release this package because...
> >
>
> +1. Thanks Tim!
>
> Cheers,
> Dave
>


Re: [VOTE] Release Apache Tika 2.2.0 Candidate #1

2021-12-14 Thread Oleg Tikhonov
+1 

> On 15 Dec 2021, at 0:01, Tim Allison  wrote:
> 
> +1
> 
> On Tue, Dec 14, 2021 at 4:31 PM Lewis John McGibbney 
> wrote:
> 
>> I'll submit a PR for the README but I think it's also worthwile to augment
>> the release management guide so that the message to review the release
>> candidate includes this information.
>> lewismc
>> 
>> On 2021/12/14 20:17:05 Tim Allison wrote:
>>> Y, you're right. Lewis, where should we mention the Docker requirement
>>> on our site?
>>> 
>>> On Tue, Dec 14, 2021 at 3:06 PM Lewis John McGibbney 
>> wrote:
 
 Hi Ken,
 
 On 2021/12/13 22:38:49 Ken Krugler wrote:
> That error looks like you’ve got a connection issue with the Maven
>> central repo…
> 
> — Ken
 
 Yes you are correct :)
 
 Once that issue sorted itself out my local build passed so my +1
>> stands.
 
 I this it is worthwhile us stating that Docker is a prerequisite for
>> installing from source. This is required for the tika-pipes* modules.
 
 lewismc
>>> 
>> 



Re: [VOTE] Release Apache Tika 2.1.0 Candidate #2

2021-08-23 Thread Oleg Tikhonov
+1 basic staff, ubuntu 20.04, java 11
Thanks,
Oleg

On Mon, Aug 23, 2021, 20:58 Konstantin Gribov  wrote:

> Hi, Tim.
>
> SHA512 and gpg signatures are fine, build succeeds on Linux/OpenJDK11
> except Tesseract issue (same as before, 4.1.1 extracts "Page?2" instead of
> "Page 2" in multipage test).
> Some tests fail on OpenJDK16 because of more strict jdk internals
> encapsulation.
>
> +1
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Fri, Aug 20, 2021 at 7:02 AM Tilman Hausherr 
> wrote:
>
> > +1
> >
> > Tilman
> >
> > Am 19.08.2021 um 20:14 schrieb Tim Allison:
> > > A second candidate for the Tika 2.1.0 release is available at:
> > > https://dist.apache.org/repos/dist/dev/tika/
> > >
> > > The release candidate is a zip archive of the sources in:
> > > https://github.com/apache/tika/tree/2.1.0-rc2/
> > >
> > > The SHA-512 checksum of the archive is
> > >
> >
> c3d695b1d2104c6196a0656f4e8e54f860651de3c767642262c51c061261e5a885fe4519f0cd974673cc33a89d5c05d1937861957454542fc4d51108c0e3c1c5.
> > >
> > > In addition, a staged maven repository is available here:
> > >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1072/org/apache/tika
> > >
> > > Please vote on releasing this package as Apache Tika 2.1.0.
> > > The vote is open for the next 72 hours and passes if a majority of at
> > > least three +1 Tika PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Tika 2.1.0
> > > [ ] -1 Do not release this package because...
> > >
> > > Here's my +1.
> > >
> > > Thank you!
> > >
> > > Best,
> > >
> > >   Tim
> >
> >
> >
>


Re: [DISCUSS] Support Elasticsearch in the tika-pipes module?

2021-07-26 Thread Oleg Tikhonov
Hi Tim,
I would prefer to cut our suppot for non Apache realm lisences.
Thanks,
Oleg

On Tue, Jul 27, 2021, 00:08 Tim Allison  wrote:

> All,
>
>   As you may have heard, Amazon forked the last Apache licensed
> version of Elasticsearch and is now releasing it as pure ASL 2.0 under
> the name "OpenSearch". Elasticsearch's license is no longer open
> source.[1]
>
> Currently the OpenSearch emitter works with the 7.x version of
> Elasticsearch.  Going forward, when the projects diverge:
>
> a) do we want to support Elasticsearch and
> b) are we able to pull in a non-ASL 2.0 docker image for our unit
> tests of Elasticsearch
>
> Cheers,
>
>Tim
>
>
> [1] https://www.elastic.co/pricing/faq/licensing
>


Re: [VOTE] Release Apache Tika 2.0.0 Candidate #1

2021-07-18 Thread Oleg Tikhonov
+1
Thanks,
Oleg

> On 19 Jul 2021, at 4:04, Dave Meikle  wrote:
> 
> +1
> 
> Cheers,
> Dave
> 
> On Wed, 14 Jul 2021 at 19:16, Tim Allison  wrote:
> 
>> All,
>>  A candidate for the Tika 2.0.0 release is available
>> at:
>>  https://dist.apache.org/repos/dist/dev/tika/2.0.0
>> 
>>  The release candidate is a zip archive of the
>> sources in:
>>  https://github.com/apache/tika/tree/2.0.0-rc1/
>> 
>>  The SHA-512 checksum of the archive is
>> 
>> 
>> 31d1f2e3deb54c398fa2d4bf00c434aad3f08387debf2a34dabe6d36747bcc49f2874cbd3abe7d1209670db8284ea540bca3b574ccd1d6b8f8675bdc3f704568.
>> 
>>  In addition, a staged maven repository is available
>> here:
>> 
>> https://repository.apache.org/content/repositories/orgapachetika-1070
>> 
>>  Please vote on releasing this package as Apache
>> Tika 2.0.0.
>>  The vote is open for the next 72 hours and
>> passes if a majority of at
>>  least three +1 Tika PMC votes are cast.
>> 
>>  [ ] +1 Release this package as Apache Tika 2.0.0
>>  [ ] -1 Do not release this package because...
>> 
>> Here's my +1.
>> 
>> Cheers,
>> 
>>  Tim
>> 



Re: [VOTE] Release Apache Tika 1.27 Candidate #1

2021-07-02 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.27


> On 2 Jul 2021, at 21:21, Tilman Hausherr  wrote:
> 
> +1
> 
> Tilman
> 
> Am 30.06.2021 um 22:03 schrieb Tim Allison:
>> A candidate for the Tika 1.27 release is available at:
>>   https://dist.apache.org/repos/dist/dev/tika/1.27
>> 
>> The KEYS file for the release is available:
>>   https://dist.apache.org/repos/dist/dev/tika/KEYS
>> 
>> The release candidate is a zip archive of the sources in:
>>   https://github.com/apache/tika/tree/1.27-rc1/
>> 
>> The SHA-512 checksum of the archive is
>>   
>> ebdd382fcfdc25ee2f9d35c8231467179a2cc867c6d2c6736536e9cb366846afb439bd4ce778f52118e2b645b6e27a38d343c8f9f82d05131c6f6d4d3bfef43d.
>> 
>> In addition, a staged maven repository is available here:
>>   
>> https://repository.apache.org/content/repositories/orgapachetika-1069/org/apache/tika
>> 
>> Please vote on releasing this package as Apache Tika 1.27.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Tika 1.27
>> [ ] -1 Do not release this package because...
>> 
>> Here's my +1.
>> 
>> Cheers,
>> 
>>Tim
> 
> 



Re: [VOTE] Release Apache Tika 2.0.0-BETA Candidate #1

2021-05-20 Thread Oleg Tikhonov
Hi Tim,
My +1.
Ubuntu 20, basic stuff.
Java 11.

Best regards,
Oleg

> On 19 May 2021, at 18:29, Tim Allison  wrote:
> 
> All,
> 
> A candidate for the Tika 2.0.0-BETA release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
> 
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/tika-2.0.0-BETA-rc1
> 
> The SHA-512 checksum of the archive is
> 
> 8d6376fd87e8d8da55a0ee6b1bf7fc1c2a90145cc6d3c0014a86fbba49f853861d87f94c53a8d04733c019bc8934a5dc9a1e3859b7b544e29527c7f1a70bd751.
> 
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1068/org/apache/tika
> 
> Please vote on releasing this package as Apache Tika 2.0.0-BETA.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 2.0.0-BETA
> [ ] -1 Do not release this package because...
> 
> Here's my +1.
> 
> Cheers,
> 
>   Tim



Re: 2.0.0-BETA?

2021-05-11 Thread Oleg Tikhonov
Hi Tim,
Thanks for the effort! +1.
BR,
Oleg

On Tue, May 11, 2021, 16:51 Tim Allison  wrote:

> All,
>   What would you say to a beta release towards the end of this
> week/beginning of next?
>
>  Cheers,
>
>  Tim
>


Re: Release 1.27?

2021-04-28 Thread Oleg Tikhonov
 +1

On Wed, Apr 28, 2021, 19:22 Tim Allison  wrote:

> All,
>
>   There have been a number of key fixes in 1.x and some security fixes
> in some of our dependencies.  Any objections to starting the release
> process for 1.27 in the next few weeks?  Any blockers we need to fix
> for 1.27?
>
>  Cheers,
>
>Tim
>
> ref: https://issues.apache.org/jira/browse/TIKA-3375
>


Re: [VOTE] Accept tika-helm source code into the Apache Tika project

2021-04-09 Thread Oleg Tikhonov
Great! +1


On Fri, Apr 9, 2021, 06:10 Lewis John McGibbney  wrote:

> Hi dev@,
>
> I am opening this VOTE with the goal of donating the tika-helm source code
> [0] into the Apache Tika project.
> Tika-helm is a Helm chart [1] to deploy Apache Tika on Kubernetes (K8s)
> [2]. More specifically the chart is a really lightweight way to configure
> and run the official Tika Docker image [3] on K8s.
>
> This VOTE is a prerequisite before intellectual property clearance [4] can
> begin.
>
> [ ] +1 Accept tika-helm source code into the Apache Tika project
> [ ] +/-0 ... because
> [ ] -1 I DO NOT want to accept the tika-helm source code into the Apache
> Tika project (please state why)
>
> This VOTE will be open for a minimum of 72hrs. Thank you in advance for
> taking the time to review the proposed source code contribution and for
> VOTE'ing.
>
> lewismc
> P.S. +1 from me Tika PMC-binding
>
> [0] https://github.com/lewismc/tika-helm
> [1] https://helm.sh/docs/topics/charts/
> [2] https://kubernetes.io/
> [3] https://github.com/apache/tika-docker
> [4] http://incubator.apache.org/ip-clearance/
>


Re: [VOTE] Release Apache Tika 1.26 Candidate #1

2021-03-25 Thread Oleg Tikhonov
[INFO]

[INFO] Reactor Summary for Apache Tika 1.26:
[INFO]
[INFO] Apache Tika parent . SUCCESS [
40.841 s]
[INFO] Apache Tika core ... SUCCESS [01:08
min]
[INFO] Apache Tika parsers  SUCCESS [12:28
min]
[INFO] Apache Tika OSGi bundle  SUCCESS [01:14
min]
[INFO] Apache Tika XMP  SUCCESS [
 3.368 s]
[INFO] Apache Tika serialization .. SUCCESS [
 2.942 s]
[INFO] Apache Tika batch .. SUCCESS [03:08
min]
[INFO] Apache Tika language detection . SUCCESS [
 7.074 s]
[INFO] Apache Tika application  SUCCESS [02:02
min]
[INFO] Apache Tika translate .. SUCCESS [
 6.288 s]
[INFO] Apache Tika server . SUCCESS [02:19
min]
[INFO] Apache Tika fuzzing  SUCCESS [
 2.566 s]
[INFO] Apache Tika eval ... SUCCESS [01:07
min]
[INFO] Apache Tika examples ... SUCCESS [
26.966 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
 5.032 s]
[INFO] Apache Tika Deep Learning (powered by DL4J)  SUCCESS [08:34
min]
[INFO] Apache Tika Natural Language Processing  SUCCESS [02:21
min]
[INFO] Apache Tika  SUCCESS [
 0.033 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time:  36:13 min
[INFO] Finished at: 2021-03-25T15:29:50+02:00
[INFO]


[x] +1 Release this package as Apache Tika 1.26
Thanks, Tim!

On Thu, Mar 25, 2021 at 5:48 AM Tilman Hausherr 
wrote:

> +1
>
> Tilman
>
> Am 24.03.2021 um 16:07 schrieb Tim Allison:
> > All,
> >
> > A candidate for the Tika 1.26 release is available at:
> >https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >https://github.com/apache/tika/tree/1.26-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> 4076ebaca9acc9e0d1e82e6e33ba470717c6976b0d674457a08987b8e4da27a107756f169f30ba7b3cb0b18a016d55d6de2cbf70a359e97ab90309c988589181.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1067/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 1.26.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.26
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >Tim
>
>
>


Re: [VOTE] Release Apache Tika 2.0.0-ALPHA Candidate #1

2021-01-15 Thread Oleg Tikhonov
+1.
Good job!


On Thu, Jan 14, 2021 at 8:44 PM Tilman Hausherr 
wrote:

> +1
>
> Tilman
>
> Am 14.01.2021 um 02:19 schrieb Tim Allison:
> > All,
> >
> > A candidate for the Tika 2.0.0-ALPHA release is available at:
> >https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >https://github.com/apache/tika/tree/2.0.0-ALPHA-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> >
> ae018f4384d2cd63281422cc82ec71a5b6f5d64ac29b343d714737e6b35fee6e5d0190cd065bf069948eadeeea831c5d74a6da6a554f049d3075f40eeb984f13.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1065/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.0.0-ALPHA.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > Note: there may be still breaking changes before the formal release of
> > 2.0.0.
> >
> > Here's my +1.
> >
> > Best,
> >
> >  Tim
> >
> > [ ] +1 Release this package as Apache Tika 2.0.0-ALPHA
> > [ ] -1 Do not release this package because...
> >
>
>


Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-27 Thread Oleg Tikhonov
Here is my +1.
Did basic stuff. Seems ok.
Thanks!

On Thu, Nov 26, 2020, 01:15 Ken Krugler  wrote:

> +1
>
> Thanks Tim.
>
> — Ken
>
> > On Nov 25, 2020, at 4:20 AM, Tim Allison  wrote:
> >
> > A candidate for the Tika 1.25 release is available at:
> >   https://dist.apache.org/repos/dist/dev/tika/ <
> https://dist.apache.org/repos/dist/dev/tika/>
> >
> > The release candidate is a zip archive of the sources in:
> >   https://github.com/apache/tika/tree/1.25-rc2/ <
> https://github.com/apache/tika/tree/1.25-rc2/>
> >
> > The SHA-512 checksum of the archive is
> >
>  
> 542a04724c6e3852845b6793b8abd60b2baa7a96aed8f50d372f5ed1ede3d62c3c438e56f7d1ddf77ec8a4663eb9f06dc69863c633abc0c3ee4d4b4bd6086ec0.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1063/org/apache/tika
> <
> https://repository.apache.org/content/repositories/orgapachetika-1063/org/apache/tika
> >
> >
> > Please vote on releasing this package as Apache Tika 1.25.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.25
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >Tim
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-18 Thread Oleg Tikhonov
Hi Tim,
looks awesome.
Somehow I did not find a couple of parsers, probably it is because of
on-going work ...
In addition, I was thinking about "getting rid of" maven. If we are going
to make Tika more modern, maybe gradle can do a trick?
Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records ...

WDYT?
BR,
Oleg




On Tue, Aug 18, 2020 at 5:41 PM Tim Allison  wrote:

> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 😀
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann,  <
> mattm...@apache.org> wrote:
> >
> >
> > Haha  I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 😊
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison  
> > Reply-To: "dev@tika.apache.org"  <
> dev@tika.apache.org> , "Allison, Tim (US
> > 174B-Affiliate)"  <
> timothy.b.alli...@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: " " 
> 
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> >   I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch.  I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> >   Let me know if there are any objections to heading down this path now.
> >
> >
> >
> >Cheers,
> >
> >
> >
> >   Tim
> >
> >
> >
> >
> >
> >
>


Re: renaming master?

2020-06-16 Thread Oleg Tikhonov
Hi Tim,
for me, "main" makes more sense.
But, no objection to any other option!

Thanks,
Oleg

On Tue, Jun 16, 2020 at 8:31 PM Tim Allison  wrote:

> All,
>
>   As you may have seen, there's a movement to rename the "master" branch to
> "main" or "trunk" (at least in the U.S.)[1][2].  Github is doing this, and
> I personally think this makes sense.
>
>   Are there any objections if we change "master"?  If we do change it, is
> there a preference for "main", "trunk" or something else?
>
>   My personal preference would be for trunk, but I'm open.
>
>  Best,
>
>  Tim
>
> [1]
>
> https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/
> [2] https://www.bbc.com/news/technology-53050955
>


Re: [VOTE] Release Apache Tika 1.24.1 Candidate #1

2020-04-18 Thread Oleg Tikhonov
Hi Tim,
Thanks for doing this!
I've ran all basic stuff on Ubuntu 18 with Java 8.
All tests are passed.
Here is my + 1.

BR,
Oleg


On Sat, Apr 18, 2020 at 12:38 AM Tim Allison  wrote:

> A candidate for the Tika 1.24.1 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.24.1-rc1/
>
> The SHA-512 checksum of the archive is
>
>
> fd9ede54484483f39bbefcc6cb556c25e73ee37be3ebf8d905a6de664e0d0c5ea766798f7a54c52502ad32b6f9de3f8869c84021d9b1ba8fae1661d6c2c1f43b.
>
> In addition, a staged maven repository is available here:
>
>
> https://repository.apache.org/content/repositories/orgapachetika-1061/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 1.24.1.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.24.1
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Cheers,
>
>   Tim
>


Re: 1.24.1?

2020-04-15 Thread Oleg Tikhonov
+1. Seems ok to me.
Thanks,
Oleg

On Wed, Apr 15, 2020, 00:18 Tim Allison  wrote:

> I fixed the hwp5 multithreading problem.
>
> I looked into tar files, and the handful I reviewed had a "skip the rest of
> the final block with x bytes", but there weren't actually x bytes.  This
> didn't harm extraction because this happened on the last block.  Folks will
> get more exceptions, but will get the same content.  I think this is ok on
> balance given the improved safety we're getting with skip->skipFully in
> TikaInputStream.
>
> We do have more exceptions in mp4, but I think that is mostly on truncated
> files.
>
> In short, I _think_ we're ready to go for 1.24.1.  Please take a look at
> the reports and let me know what you think.
>
> Best,
>
>  Tim
>
> On Tue, Apr 14, 2020 at 10:36 AM Tim Allison  wrote:
>
> > All,
> >   We've made some important bug fixes since 1.24.  I recently ran the
> > regression tests locally.  The reports are here:
> >
> >
> >
> https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_24_1_reports.tgz
> >
> >   We're getting more exceptions with .tar on "read the rest of the
> > block".  I'll look into this; my initial impression is that these files
> are
> > not truncated.
> >
> >   We're also getting more exceptions on mp4 with 0-length records, which,
> > I think, is a side effect of truncation.
> >
> >   Let me know what else you see.
> >
> >Cheers,
> >
> >   Tim
> >
>


Re: [EXTERNAL] Re: JDK 12 build issues

2020-03-18 Thread Oleg Tikhonov
Hi Chris,
I'm currently trying to build an env with java 12/13 ... in order to try
your setup.
What java version are you using? open jdk or oracle?
One upon a time was a bug in openjdk
https://bugs.openjdk.java.net/browse/JDK-8131146
But it seems to be ok in recent releases.

Keep you updated.
Cheers,
Oleg


On Wed, Mar 18, 2020 at 4:35 PM Chris Mattmann  wrote:

> So I was able to get past my issues with Tesseract by reinstalling the
> latest version with Brew.
>
>
>
> I have a new issue!
>
> I’ve tried in JDK12 and JDK13 to build tika-dl, but it keeps failing:
>
>
>
> [INFO]
>
> [INFO] --- maven-compiler-plugin:3.8.0:testCompile (default-testCompile) @
> tika-dl ---
>
> [INFO] Changes detected - recompiling the module!
>
> [INFO] Compiling 2 source files to
> /Users/mattmann/src/tika/tika-dl/target/test-classes
>
> [INFO]
>
> [INFO] --- maven-surefire-plugin:3.0.0-M4:test (default-test) @ tika-dl ---
>
> [INFO]
>
> [INFO] ---
>
> [INFO]  T E S T S
>
> [INFO] ---
>
> [INFO] Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest
>
> log4j:WARN No appenders could be found for logger
> (org.nd4j.linalg.factory.Nd4jBackend).
>
> log4j:WARN Please initialize the log4j system properly.
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 3.38 s <<< FAILURE! - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest
>
> [ERROR] org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise  Time
> elapsed: 3.29 s  <<< ERROR!
>
> org.apache.tika.exception.TikaConfigException: java.io.UTFDataFormatException:
> malformed input around byte 11
>
>at
> org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:36)
>
> Caused by: java.lang.RuntimeException: java.io.UTFDataFormatException:
> malformed input around byte 11
>
>at
> org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:36)
>
> Caused by: java.io.UTFDataFormatException: malformed input around byte 11
>
>at
> org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:36)
>
>
>
> [INFO] Running org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest
>
> [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
> 5.392 s - in org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest
>
> [INFO]
>
> [INFO] Results:
>
> [INFO]
>
> [ERROR] Errors:
>
> [ERROR]   DL4JVGG16NetTest.recognise:36 » TikaConfig 
> java.io.UTFDataFormatException:
> mal...
>
> [INFO]
>
> [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0
>
> [INFO]
>
> [INFO]
> 
>
> [INFO] BUILD FAILURE
>
> [INFO]
> 
>
> [INFO] Total time:  25.628 s
>
> [INFO] Finished at: 2020-03-18T07:34:08-07:00
>
> [INFO]
> 
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M4:test (default-test)
> on project tika-dl: There are test failures.
>
> [ERROR]
>
> [ERROR] Please refer to
> /Users/mattmann/src/tika/tika-dl/target/surefire-reports for the individual
> test results.
>
> [ERROR] Please refer to dump files (if any exist) [date].dump,
> [date]-jvmRun[N].dump and [date].dumpstream.
>
> [ERROR] -> [Help 1]
>
> [ERROR]
>
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
>
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>
> [ERROR]
>
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
>
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>
> pomodoro:tika-dl mattmann$
>
>
>
> Thamme, do you have any ideas what is going on here?
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" , "Allison, Timothy
> B (US 1760-Affiliate)" 
> Date: Wednesday, March 18, 2020 at 2:35 AM
> To: "dev@tika.apache.org" 
> Subject: [EXTERNAL] Re: JDK 12 build issues
>
>
>
> Haven’t tried...we should add java 12-14 to Jenkins.
>
>
>
> Wait, are we up to 18 yet...
>
>
>
> Will look into it...
>
>
>
> On Tue, Mar 17, 2020 at 10:07 PM Chris Mattmann 
> wrote:
>
>
>
> Hey Tim et al.,
>
>
>
>
>
>
>
> Do the tests fail for you with Java 12?
>
>
>
>
>
>
>
> [INFO] Running org.apache.tika.parser.pkg.GzipParserTest
>
>
>
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
>
> 0.397 s - in org.apache.tika.parser.pkg.GzipParserTest
>
>
>
> [INFO] Running org.apache.tika.TestXMLEntityExpansion
>
>
>
> [WARNING] Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed:
>
> 0.085 s - in org.apache.tika.TestXMLEntityExpansion
>
>
>
> [INFO] Running org.apache.tika.mime.MimeTypeTest
>
>
>
> [INFO]

Re: 1.24?

2020-02-05 Thread Oleg Tikhonov
>> Should we wait for the next version of PDFBox?
May be it's worth waiting
>> what would you think of the week of the 23rd/ first week of
March?
Sounds good.

BR,
Oleg

On Wed, Feb 5, 2020 at 4:41 PM Tim Allison  wrote:

> All,
>
>   The new version of POI will be out soon.  I have a couple of more things
> I'd like to get in.  What else do we want to do before the next release?
>
>Should we wait for the next version of PDFBox?
>
>   Tentatively, what would you think of the week of the 23rd/ first week of
> March?
>
> Cheers,
>
>   Tim
>


Re: [VOTE] Release Apache Tika 1.23 Candidate #2

2019-12-03 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.23

Thanks,
Oleg

On Tue, Dec 3, 2019 at 5:15 AM Tim Allison  wrote:

> A candidate for the Tika 1.23 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.23-rc2/
>
> The SHA-512 checksum of the archive is
>
>
> d6e91f6b29183f836ccb4faabb690c07f4c33408d846f3d93e65b780745ca8c1dd6bb7cea6c265e987a06c318cbea2fcedc4c7ca723c030da46bbcd3423b49cf.
>
> In addition, a staged maven repository is available here:
>
>
> https://repository.apache.org/content/repositories/orgapachetika-1057/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 1.23.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.23
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Cheers,
>
>  Tim
>


Re: [VOTE] Release Apache Tika 1.23 Candidate #1

2019-11-29 Thread Oleg Tikhonov
Hi, here is my +1.
All tests are passed un ubuntu 19.04.
Thanks Tim!

Best Regards,
Oleg

On Thu, Nov 28, 2019, 15:39 Markus Jelsma 
wrote:

> +1!
>
> All tests pass and i can seamlessly update our internal software to 1.23.
>
> Thanks!
>
> -Original message-
> > From:Tim Allison 
> > Sent: Tuesday 26th November 2019 22:34
> > To:  ; u...@tika.apache.org
> > Subject: [VOTE] Release Apache Tika 1.23 Candidate #1
> >
> > All,
> >
> > A candidate for the Tika 1.23 release is available at:
> >   https://dist.apache.org/repos/dist/dev/tika/ <
> https://dist.apache.org/repos/dist/dev/tika/>
> >
> > The release candidate is a zip archive of the sources in:
> >   https://github.com/apache/tika/tree/1.23-rc1/ <
> https://github.com/apache/tika/tree/1.23-rc1/>
> >
> > The SHA-512 checksum of the archive is
> >
> b0c277216e05c90f3cc40f591ef5d92707e94b47b54da0503bd54c0a3bdc1df41c63b0f996529206bca87afa28f6b62300113514959ac2470405b764094f9f8b.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1056/org/apache/tika
> <
> https://repository.apache.org/content/repositories/orgapachetika-1056/org/apache/tika
> >
> >
> > Please vote on releasing this package as Apache Tika 1.23.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.23
> > [ ] -1 Do not release this package because...
> >
> > This is my first time building on Ubuntu...please do look carefully!
> >
> > Heres my +1.
> >
> > Cheers,
> >
> >  Tim
>


Re: [EXTERNAL] Docker image along with 1.23?

2019-11-21 Thread Oleg Tikhonov
My question is more pragmatic.
What we put inside the Dockerfile, on which image it will be based on (say
Ubuntu) ...
What will contain an entrypoint? Tika Server? Should we "install" a
tesseract? Anything more?

Thanks,
Oleg

On Thu, Nov 21, 2019 at 4:46 AM Chris Mattmann  wrote:

> Yeah producing the actual image is tricky and my recommendation is for
> Tika to
> stay out of the business of that. Leave it to LogicalSpark or others to do
> this. It’s
> tricky with licenses and I doubt ASF will ever develop an optimal solution
> to this
> due to the nature of its core mission as Nick stated.
>
>
>
>
>
>
>
>
>
> From: Eric Pugh 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, November 20, 2019 at 6:02 PM
> To: "dev@tika.apache.org" 
> Cc: "Allison, Timothy B (US 1760-Affiliate)" <
> timothy.b.alli...@jpl.nasa.gov>
> Subject: Re: [EXTERNAL] Docker image along with 1.23?
>
>
>
> I was thinking more of producing the actual image, so that others don’t
> have to go through the pain of compiling an image.   Having the Dockerfile
> made available as well does give a nice recipe for modifying the “official”
> image.   I recently tested Tesseract 3 with the latest Tika, and I did it
> by tweaking the existing Dockerfile that LogicalSpark has published.
>
>
>
> I don’t know how other projects at ASF handle the image publishing.
>
>
>
>
>
>
>
>
>
> On Nov 20, 2019, at 7:02 PM, Chris Mattmann  wrote:
>
> Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply
> shipping text file,
>
> code. Under a license. If we create a “docker image” and then publish it
> to the ASF
>
> hub then I agree with you.
>
> My suggestion and my interpretation of Tim’s is to ship a standard
> “Dockerfile”. Do you
>
> agree with this? It should be air covered (as former VP, Legal, at least
> it would have been
>
> with me).
>
> Cheers,
>
> Chris
>
> From: Nick Burch 
>
> Reply-To: "dev@tika.apache.org" 
>
> Date: Wednesday, November 20, 2019 at 3:57 PM
>
> To: "Allison, Timothy B (US 1760-Affiliate)" <
> timothy.b.alli...@jpl.nasa.gov>
>
> Cc: "" 
>
> Subject: [EXTERNAL] Re: Docker image along with 1.23?
>
> On Wed, 20 Nov 2019, Tim Allison wrote:
>
> Eric Pugh recently asked on another channel if we had any plans to
>
> release an official docker image for 1.23.
>
> Depending on what we put in the container, we do need to be a little
>
> careful. There's "platform dependencies" under non-compatible licenses
>
> that we can optionally use if people have installed them, which we
>
> ourselves can't directly ship under ASF rules. (Tesseract is fine as
>
> that's Apache Licenses, Java itself is trickier, see the Netbeans
>
> discussions on legal-discuss@ and LEGAL jira)
>
> Shipping an official docker container with the Tika Server on seems to me
>
> to be a helpful step for users, but we just need to make sure we're
>
> following ASF policies. (The Apache Software Foundation mission is to
>
> "provide software for the public good", but source code is the main focus
>
> for the mission, binaries are trickier!)
>
> Nick
>
>
>
> ___
>
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>
>
>
>
>


Re: [ANNOUNCE] Welcome Tilman Hausherr as Tika PMC member and committer

2019-10-04 Thread Oleg Tikhonov
Welcome a board Tilman!!!

Best regards,
Oleg


On Fri, Oct 4, 2019 at 5:37 PM Tilman Hausherr 
wrote:

> Am 04.10.2019 um 16:19 schrieb Tim Allison:
> > All,
> >
> > The Tika PMC has elected to add Tilman Hausherr to our ranks.  Tilman,
> > please feel free to introduce yourself, and welcome aboard!
> >
> > Cheers,
> >
> >   Tim
>
> Hello everybody,
>
> Thanks for the honor. A bit about me: I'm from Germany (coincidentally,
> yesterday was our national holiday, 29 years of reunification), I'm 50+
> years old, studied CS at TU Berlin, still living in Berlin, and now
> working at an IT company where my main job is document capture /
> classification / processing for our clients. I have known tika mostly
> from the PDFBox issues. Because I know next to nothing about the tika
> code I'll probably focus on refactoring / build issues (e.g. version
> updates) / documentation.
>
> Best regards
>
> Tilman Hausherr
>
>


Re: [VOTE] Release Apache Tika 1.22 Candidate #4

2019-07-30 Thread Oleg Tikhonov
Hi Tim,
thanks for the release !!!
Here is my +1, tested on Ubuntu 18.04.2 LTS, x_86 arc.

Best wishes,
Oleg

On Mon, Jul 29, 2019 at 8:50 PM Tim Allison  wrote:

> A candidate for the Tika 1.22 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
>
> The release candidate is a zip archive of the sources in:
>
>   https://github.com/apache/tika/tree/1.22-rc4/
>
>
> The SHA-512 checksum of the archive is
>
>
> bbdf2683a63a0e5fbe66f10eb88c29cd14128c3dd8c680bf1c86352c8068cd6d61358eb506f728f494c0dcd084af48f4312f832f6467863f58c3b90ab59e9966.
>
>
> In addition, a staged maven repository is available here:
>
>
> https://repository.apache.org/content/repositories/orgapachetika-1055/org/apache/tika
>
>
> Please vote on releasing this package as Apache Tika 1.22.
>
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>
> [ ] +1 Release this package as Apache Tika 1.22
>
> [ ] -1 Do not release this package because...
>
> Here's my +1.  I've built this on Windows and Mac w and w/out spaces
> in the path. :P . Thank you for your patience.
>
> Cheers,
>
>Tim
>


Re: 1.22?

2019-07-15 Thread Oleg Tikhonov
+1

On Mon, Jul 15, 2019 at 2:41 PM Tim Allison  wrote:

> Anyone have anything they want to get into 1.22? If not, I’ll kick off the
> regression tests shortly.
>
> Cheers,
>  Tim
>


Re: Tika 1.22?

2019-06-25 Thread Oleg Tikhonov
Would be great!!!
Cheers,
Oleg

On Tue, Jun 25, 2019, 17:45 Tim Allison  wrote:

> All,
>   The vote for the next version of PDFBox is under way.  I think we've
> had a number of useful upgrades since our last release.  Any
> objections to starting the release process for Tika 1.22 a week or so
> after we integrate PDFBox?
>
>  Cheers,
>
>   Tim
>


Re: [jira] [Commented] (TIKA-2878) Update dependencies for 1.21.1 or 1.22

2019-05-20 Thread Oleg Tikhonov
Today I've also used a master branch and got the same result.


On Mon, May 20, 2019 at 8:59 PM Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844167#comment-16844167
> ]
>
> Tim Allison commented on TIKA-2878:
> ---
>
> Yay!  Thank you!
>
> > Update dependencies for 1.21.1 or 1.22
> > --
> >
> > Key: TIKA-2878
> > URL: https://issues.apache.org/jira/browse/TIKA-2878
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Priority: Major
> >
> > And in the category of "stuff you can't make up"...while generating the
> javadocs for the 1.21 release:
> > We're now getting this in {{tika-parsers}}:
> > {noformat}
> >   c3p0:c3p0:jar:0.9.1.1:compile;
> https://ossindex.sonatype.org/component/pkg:maven/c3p0/c3p0@0.9.1.1
> > * [CVE-2019-5427]  Resource Management Errors (7.5);
> https://ossindex.sonatype.org/vuln/d25f4c21-9e76-4fc2-9d73-3770aa3aec56
> > {noformat}
> > and in {{tika-server}}:
> > {noformat}
> > * [CVE-2019-10247]  Information Exposure (5.3);
> https://ossindex.sonatype.org/vuln/47ad4d7e-b9c3-414f-9bfa-1dfaa92b0aba
> > * [CVE-2019-10241]  Improper Neutralization of Input During Web Page
> Generation ("Cross-site Scripting") (6.1);
> https://ossindex.sonatype.org/vuln/970aece8-4a1d-4a9e-ab97-0982b13dac4d
> >   org.eclipse.jetty:jetty-server:jar:9.4.14.v20181114:compile;
> https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/jetty-server@9.4.14.v20181114
> > * [CVE-2019-10247]  Information Exposure (5.3);
> https://ossindex.sonatype.org/vuln/47ad4d7e-b9c3-414f-9bfa-1dfaa92b0aba
> > * [CVE-2019-10241]  Improper Neutralization of Input During Web Page
> Generation ("Cross-site Scripting") (6.1);
> https://ossindex.sonatype.org/vuln/970aece8-4a1d-4a9e-ab97-0982b13dac4d
> > {noformat}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [VOTE] Release Apache Tika 1.21 Candidate #2

2019-05-15 Thread Oleg Tikhonov
Here is my +1.
Thanks, Tim!


On Wed, May 15, 2019 at 5:16 AM Tim Allison  wrote:

> A candidate for the Tika 1.21 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.21-rc2/
>
> The SHA-512 checksum of the archive is:
>
> 67748553a44b3acb009f0e99ac595c5babfe04d4a75abd2efde614ca26f177c863f7aa598d6911a7b3ca146075c84ecdf0fc3c337d7145d050c889fb4cc4f14f
>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1048/org/apache/tika
>
>
> Please vote on releasing this package as Apache Tika 1.21.
>
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.21
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Cheers,
>
>   Tim
>


Re: [VOTE] Release Apache Tika 1.21 Candidate #1

2019-05-14 Thread Oleg Tikhonov
:-)
I'm good with any option. RC1 seems to be good from my point of view.
Cheers,
Oleg

On Tue, May 14, 2019 at 3:56 PM Tim Allison  wrote:

> All,
>   I'm happy to close rc1 and respin an rc2 after Oleg's findings
> (TIKA-2871 and TIKA-2872)...many thanks, Oleg!  I'm also happy to
> proceed with rc1 as is...Let me know your preferences.
>
>   Cheers,
>
>Tim
>
> On Mon, May 13, 2019 at 1:32 PM Tim Allison  wrote:
> >
> > A candidate for the Tika 1.21 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >   https://github.com/apache/tika/tree/1.21-rc1/
> >
> > The SHA-512 checksum of the archive is:
> >
> 4bc861f3b9ba37df14726d8acf173185a5414b88774c0b00c1f82140e290ebdac1a146952a0dd3755a29e7281cb45f55dceb96c7d7de5aef55fa5923f1164ac2.
> >
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1047/org/apache/tika
> >
> >
> > Please vote on releasing this package as Apache Tika 1.21.
> >
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.21
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >   Tim
>


Re: [VOTE] Release Apache Tika 1.21 Candidate #1

2019-05-14 Thread Oleg Tikhonov
Hi all,

[x] +1 Release this package as Apache Tika 1.21

I've ran just basic stuff, mvn clean install (Ubuntu x86, java 8).
Seems to be good.

Thanks,
Oleg

On Mon, May 13, 2019 at 8:33 PM Tim Allison  wrote:

> A candidate for the Tika 1.21 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.21-rc1/
>
> The SHA-512 checksum of the archive is:
>
> 4bc861f3b9ba37df14726d8acf173185a5414b88774c0b00c1f82140e290ebdac1a146952a0dd3755a29e7281cb45f55dceb96c7d7de5aef55fa5923f1164ac2.
>
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1047/org/apache/tika
>
>
> Please vote on releasing this package as Apache Tika 1.21.
>
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.21
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Cheers,
>
>   Tim
>


[jira] [Commented] (TIKA-2872) tika-dl - add slf4j-log4j12 dependency to pom.xml

2019-05-14 Thread Oleg Tikhonov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839311#comment-16839311
 ] 

Oleg Tikhonov commented on TIKA-2872:
-

Possible fix attached.

> tika-dl - add slf4j-log4j12 dependency to pom.xml
> -
>
> Key: TIKA-2872
> URL: https://issues.apache.org/jira/browse/TIKA-2872
> Project: Tika
>  Issue Type: Bug
>    Reporter: Oleg Tikhonov
>Priority: Trivial
> Attachments: tika-dl-pom.xml.patch
>
>
> on start unit test thrown exception - missing log4j jar in classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2872) tika-dl - add slf4j-log4j12 dependency to pom.xml

2019-05-14 Thread Oleg Tikhonov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Tikhonov updated TIKA-2872:

Attachment: tika-dl-pom.xml.patch

> tika-dl - add slf4j-log4j12 dependency to pom.xml
> -
>
> Key: TIKA-2872
> URL: https://issues.apache.org/jira/browse/TIKA-2872
> Project: Tika
>  Issue Type: Bug
>    Reporter: Oleg Tikhonov
>Priority: Trivial
> Attachments: tika-dl-pom.xml.patch
>
>
> on start unit test thrown exception - missing log4j jar in classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2872) tika-dl - add slf4j-log4j12 dependency to pom.xml

2019-05-14 Thread Oleg Tikhonov (JIRA)
Oleg Tikhonov created TIKA-2872:
---

 Summary: tika-dl - add slf4j-log4j12 dependency to pom.xml
 Key: TIKA-2872
 URL: https://issues.apache.org/jira/browse/TIKA-2872
 Project: Tika
  Issue Type: Bug
Reporter: Oleg Tikhonov


on start unit test thrown exception - missing log4j jar in classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2871) TestChmExtraction - testMultiThreaded throws exception 1.21-rc1

2019-05-14 Thread Oleg Tikhonov (JIRA)
Oleg Tikhonov created TIKA-2871:
---

 Summary: TestChmExtraction - testMultiThreaded throws exception 
1.21-rc1
 Key: TIKA-2871
 URL: https://issues.apache.org/jira/browse/TIKA-2871
 Project: Tika
  Issue Type: Bug
Reporter: Oleg Tikhonov


During mvn clean install I've seen an exception"



[INFO] Running org.apache.tika.parser.chm.TestChmExtraction
org.apache.tika.exception.TikaException: can't copy beyond array length
    at 
org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:347)
    at 
org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet.enumerateChmDirectoryListingList(ChmDirectoryListingSet.java:147)
    at 
org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet.(ChmDirectoryListingSet.java:65)
    at 
org.apache.tika.parser.chm.core.ChmExtractor.(ChmExtractor.java:182)
    at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:63)
    at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)
    at 
org.apache.tika.MultiThreadedTikaTest.getRecursiveMetadata(MultiThreadedTikaTest.java:291)
    at 
org.apache.tika.MultiThreadedTikaTest.getBaseline(MultiThreadedTikaTest.java:271)
    at 
org.apache.tika.MultiThreadedTikaTest.testAll(MultiThreadedTikaTest.java:182)
    at 
org.apache.tika.MultiThreadedTikaTest.testEach(MultiThreadedTikaTest.java:163)
    at 
org.apache.tika.MultiThreadedTikaTest.testMultiThreaded(MultiThreadedTikaTest.java:77)
    at 
org.apache.tika.parser.chm.TestChmExtraction.testMultiThreaded(TestChmExtraction.java:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:305)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:365)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:330)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:78)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:328)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:65)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:305)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:412)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
    at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
    at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
    at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
    at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tika 1.21?

2019-04-22 Thread Oleg Tikhonov
+1 to wait if needed.

On Mon, Apr 22, 2019, 23:23 Tim Allison  wrote:

> All,
>   I just made a bunch of upgrades to our dependencies.  I still want
> to take a first pass at TIKA-2749...maybe by the end of this week with
> release process kicking off the following week?  I could start the
> regression tests now (well, tomorrowish), though, unless anyone has
> anything they want to get in...I'm happy to wait, though, till next
> week to start the regression tests.
>  WDYT?
>
>Cheers,
>
>    Tim
>
> On Mon, Apr 8, 2019 at 2:25 PM Oleg Tikhonov 
> wrote:
> >
> > Great!
> > +1.
> > Thanks,
> > Oleg
> >
> > On Mon, Apr 8, 2019, 21:11 Tim Allison  wrote:
> >
> > > All,
> > >   PDFBox will be out in a few days, and POI should be out soon as
> > > well.  I _think_ I'd like to get in a first draft of "auto" mode for
> > > OCR'ing PDFs (TIKA-2749), but other than that, I'd be willing to run a
> > > release of 1.21 in the next few weeks.
> > >   WDYT?
> > >
> > > Best,
> > >
> > >Tim
> > >
>


Re: Tika 1.21?

2019-04-08 Thread Oleg Tikhonov
Great!
+1.
Thanks,
Oleg

On Mon, Apr 8, 2019, 21:11 Tim Allison  wrote:

> All,
>   PDFBox will be out in a few days, and POI should be out soon as
> well.  I _think_ I'd like to get in a first draft of "auto" mode for
> OCR'ing PDFs (TIKA-2749), but other than that, I'd be willing to run a
> release of 1.21 in the next few weeks.
>   WDYT?
>
> Best,
>
>Tim
>


[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

2019-03-31 Thread Oleg Tikhonov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806141#comment-16806141
 ] 

Oleg Tikhonov commented on TIKA-2650:
-

There is no simple solution. Here is some research related to [link automatic 
text 
correction|https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&ved=2ahUKEwiW5P23r6zhAhU-TxUIHXawD2wQFjAKegQIAxAC&url=https%3A%2F%2Flinguistics.washington.edu%2Ffile%2F532%2Fdownload%3Ftoken%3DhlHhM4Qw&usg=AOvVaw09nb2qj9vESK5LHV-LORcn]

> Soft-hyphen is not extracted properly
> -
>
> Key: TIKA-2650
> URL: https://issues.apache.org/jira/browse/TIKA-2650
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.18
>Reporter: Saurabh Patil
>Priority: Blocker
> Attachments: Peter Rabbit.pdf, output.txt
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end 
> of line then after half word there is soft hyphen and remaining word goes to 
> next line. but which extracting these text TIKA automatically replace hyphen 
> with space.  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Oleg Tikhonov
*stuff

On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov  All basic staff passed.
> +1.
> Oleg
>
> On Fri, Dec 21, 2018, 22:02 Ken Krugler  wrote:
>
>> Hi Tim,
>>
>> Thanks for rolling the release.
>>
>> Built & validated on Mac OS X 10.12
>>
>> Updated flink-crawler, all tests pass.
>>
>> So here’s my +1
>>
>> — Ken
>>
>>
>> > On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
>> >
>> > A candidate for the Tika 1.20 release is available at:
>> >
>> >  https://dist.apache.org/repos/dist/dev/tika/
>> >
>> > The release candidate is a zip archive of the sources in:
>> >  https://github.com/apache/tika/tree/1.20-rc1/
>> >
>> > The SHA-512 checksum of the archive is
>> >
>> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
>> >
>> > In addition, a staged maven repository is available here:
>> >
>> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
>> >
>> >
>> > Please vote on releasing this package as Apache Tika 1.20.
>> >
>> > The vote is open for the next 72 hours and passes if a majority of at
>> > least three +1 Tika PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Tika 1.20
>> > [ ] -1 Do not release this package because...
>> >
>> > Here's my +1.
>> >
>> > Cheers,
>> >
>> >  Tim
>>
>> --
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>>
>>


Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Oleg Tikhonov
All basic staff passed.
+1.
Oleg

On Fri, Dec 21, 2018, 22:02 Ken Krugler  Hi Tim,
>
> Thanks for rolling the release.
>
> Built & validated on Mac OS X 10.12
>
> Updated flink-crawler, all tests pass.
>
> So here’s my +1
>
> — Ken
>
>
> > On Dec 17, 2018, at 6:14 PM, Tim Allison  wrote:
> >
> > A candidate for the Tika 1.20 release is available at:
> >
> >  https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >  https://github.com/apache/tika/tree/1.20-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295aea2e628c5ebc0d49c09570511033e30ac44c4a6e09ec81cbba9f485b3c21a35b01014a74f1852e.
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachetika-1046/org/apache/tika
> >
> >
> > Please vote on releasing this package as Apache Tika 1.20.
> >
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.20
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >  Tim
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>


[jira] [Comment Edited] (TIKA-2368) Clean up SentimentParser dependencies

2018-10-14 Thread Oleg Tikhonov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649290#comment-16649290
 ] 

Oleg Tikhonov edited comment on TIKA-2368 at 10/14/18 8:14 AM:
---

{code:java}
[INFO] Apache Tika parent . SUCCESS [ 21.607 s] 
[INFO] Apache Tika core ... SUCCESS [  1.692 s] 
[INFO] Apache Tika parsers  SUCCESS [  7.597 s] 
[INFO] Apache Tika XMP  SUCCESS [  3.818 s] 
[INFO] Apache Tika serialization .. SUCCESS [  1.064 s] 
[INFO] Apache Tika batch .. SUCCESS [  2.633 s] 
[INFO] Apache Tika language detection . SUCCESS [  1.204 s] 
[INFO] Apache Tika application  SUCCESS [  3.097 s] 
[INFO] Apache Tika OSGi bundle  SUCCESS [  3.941 s] 
[INFO] Apache Tika translate .. SUCCESS [  1.443 s] 
[INFO] Apache Tika server . SUCCESS [  5.369 s] 
[INFO] Apache Tika examples ... FAILURE [  4.380 s] 
[INFO] Apache Tika Java-7 Components .. SKIPPED [INFO] 
Apache Tika eval ... SKIPPED [INFO] Apache Tika 
Deep Learning (powered by DL4J)  SKIPPED [INFO] Apache Tika Natural 
Language Processing  SKIPPED [INFO] Apache Tika 
 SKIPPED [INFO] 
 [INFO] 
BUILD FAILURE [INFO] 
 [INFO] 
Total time: 01:04 min [INFO] Finished at: 2018-10-14T11:07:24+03:00 [INFO] 
Final Memory: 33M/275M [INFO] 
 
[ERROR] Failed to execute goal 
org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.1:audit (default-cli) on 
project tika-example: Detected 1 vulnerable components: [ERROR]   
org.springframework:spring-core:jar:3.2.16.RELEASE:compile; 
https://ossindex.sonatype.org/component/pkg:maven/org.springframework/spring-core@3.2.16.RELEASE
 [ERROR] * [CVE-2018-1270]  Improperly Implemented Security Check for 
Standard (9.8); 
https://ossindex.sonatype.org/vuln/9a3de118-b038-49ed-9af7-533210c9d85f [ERROR] 
    * [CVE-2018-1271]  Improper Limitation of a Pathname to a Restricted 
Directory ("Path Traversal") (5.9); 
https://ossindex.sonatype.org/vuln/580d61c3-20df-4bb8-99c3-36c89e0d7550 [ERROR] 
    * [CVE-2018-1272]  Permissions, Privileges, and Access Controls (7.5); 
https://ossindex.sonatype.org/vuln/aa7190e3-4c47-42d6-82f6-afaf1da5762e [ERROR] 
[ERROR] ->
{code}
I've upgraded spring-context to 4.3.19.RELEASE and it passed the scanning.

 

 

 


was (Author: o...@apache.org):
{code:java}
[INFO] Apache Tika parent . SUCCESS [ 21.607 s] 
[INFO] Apache Tika core ... SUCCESS [  1.692 s] 
[INFO] Apache Tika parsers  SUCCESS [  7.597 s] 
[INFO] Apache Tika XMP  SUCCESS [  3.818 s] 
[INFO] Apache Tika serialization .. SUCCESS [  1.064 s] 
[INFO] Apache Tika batch .. SUCCESS [  2.633 s] 
[INFO] Apache Tika language detection . SUCCESS [  1.204 s] 
[INFO] Apache Tika application  SUCCESS [  3.097 s] 
[INFO] Apache Tika OSGi bundle  SUCCESS [  3.941 s] 
[INFO] Apache Tika translate .. SUCCESS [  1.443 s] 
[INFO] Apache Tika server . SUCCESS [  5.369 s] 
[INFO] Apache Tika examples ... FAILURE [  4.380 s] 
[INFO] Apache Tika Java-7 Components .. SKIPPED [INFO] 
Apache Tika eval ... SKIPPED [INFO] Apache Tika 
Deep Learning (powered by DL4J)  SKIPPED [INFO] Apache Tika Natural 
Language Processing  SKIPPED [INFO] Apache Tika 
 SKIPPED [INFO] 
 [INFO] 
BUILD FAILURE [INFO] 
 [INFO] 
Total time: 01:04 min [INFO] Finished at: 2018-10-14T11:07:24+03:00 [INFO] 
Final Memory: 33M/275M [INFO] 
 
[ERROR] Failed to execute goal 
org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.1:audit (default-cli) on 
project tika-example: Detected 1 vulnerable components: [ERROR]   
org.springframework:spring-core:jar:3.2.16.RELEASE:compile; 
https://ossindex.sonatype.o

[jira] [Commented] (TIKA-2368) Clean up SentimentParser dependencies

2018-10-14 Thread Oleg Tikhonov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649290#comment-16649290
 ] 

Oleg Tikhonov commented on TIKA-2368:
-

{code:java}
[INFO] Apache Tika parent . SUCCESS [ 21.607 s] 
[INFO] Apache Tika core ... SUCCESS [  1.692 s] 
[INFO] Apache Tika parsers  SUCCESS [  7.597 s] 
[INFO] Apache Tika XMP  SUCCESS [  3.818 s] 
[INFO] Apache Tika serialization .. SUCCESS [  1.064 s] 
[INFO] Apache Tika batch .. SUCCESS [  2.633 s] 
[INFO] Apache Tika language detection . SUCCESS [  1.204 s] 
[INFO] Apache Tika application  SUCCESS [  3.097 s] 
[INFO] Apache Tika OSGi bundle  SUCCESS [  3.941 s] 
[INFO] Apache Tika translate .. SUCCESS [  1.443 s] 
[INFO] Apache Tika server . SUCCESS [  5.369 s] 
[INFO] Apache Tika examples ... FAILURE [  4.380 s] 
[INFO] Apache Tika Java-7 Components .. SKIPPED [INFO] 
Apache Tika eval ... SKIPPED [INFO] Apache Tika 
Deep Learning (powered by DL4J)  SKIPPED [INFO] Apache Tika Natural 
Language Processing  SKIPPED [INFO] Apache Tika 
 SKIPPED [INFO] 
 [INFO] 
BUILD FAILURE [INFO] 
 [INFO] 
Total time: 01:04 min [INFO] Finished at: 2018-10-14T11:07:24+03:00 [INFO] 
Final Memory: 33M/275M [INFO] 
 
[ERROR] Failed to execute goal 
org.sonatype.ossindex.maven:ossindex-maven-plugin:3.0.1:audit (default-cli) on 
project tika-example: Detected 1 vulnerable components: [ERROR]   
org.springframework:spring-core:jar:3.2.16.RELEASE:compile; 
https://ossindex.sonatype.org/component/pkg:maven/org.springframework/spring-core@3.2.16.RELEASE
 [ERROR] * [CVE-2018-1270]  Improperly Implemented Security Check for 
Standard (9.8); 
https://ossindex.sonatype.org/vuln/9a3de118-b038-49ed-9af7-533210c9d85f [ERROR] 
    * [CVE-2018-1271]  Improper Limitation of a Pathname to a Restricted 
Directory ("Path Traversal") (5.9); 
https://ossindex.sonatype.org/vuln/580d61c3-20df-4bb8-99c3-36c89e0d7550 [ERROR] 
    * [CVE-2018-1272]  Permissions, Privileges, and Access Controls (7.5); 
https://ossindex.sonatype.org/vuln/aa7190e3-4c47-42d6-82f6-afaf1da5762e [ERROR] 
[ERROR] ->
{code}
I've upgraded spring-context to 4.3.19.RELEASE and is passed the scanning.

 

 

 

> Clean up SentimentParser dependencies
> -
>
> Key: TIKA-2368
> URL: https://issues.apache.org/jira/browse/TIKA-2368
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>
> Is there any way to avoid reliance on edu.usc.ir's sentiment-analysis-parser? 
>  I ask because:
> {noformat}
> [WARNING] sentiment-analysis-parser-0.1.jar, tika-parsers-1.15-SNAPSHOT.jar 
> define 1 overlapping classes: 
> [WARNING]   - org.apache.tika.parser.sentiment.analysis.SentimentParser
> [WARNING] tika-core-1.15-SNAPSHOT.jar, tika-translate-1.15-SNAPSHOT.jar 
> define 4 overlapping classes: 
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator$1
> [WARNING]   - org.apache.tika.language.translate.EmptyTranslator
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator
> [WARNING]   - org.apache.tika.language.translate.Translator
> {noformat}
> We should be ok keeping things as they are and excluding SentimentParser and 
> tika-translate, but can we easily move the code that's still in edu.usc.ir's 
> package into Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Fwd: DIH for TikaEntityProcessor

2018-10-12 Thread Oleg Tikhonov
-- Forwarded message -
From: Martin Frank Hansen (MHQ) 
Date: Wed, Oct 10, 2018, 11:15
Subject: DIH for TikaEntityProcessor
To: solr-u...@lucene.apache.org 


Hi,



I am trying to read documents from a file system into Solr, using
dataimporthandler but keep getting the following errors:



Exception while processing: files document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
to java.io.InputStream

 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

 at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassCastException: java.io.InputStreamReader
cannot be cast to java.io.InputStream

 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

 ... 9 more









Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
java.io.InputStream

 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)

 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

 at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

 at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
java.io.InputStream

 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)

 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

 ... 4 more

Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
java.io.InputStream

 at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

 at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

 ... 6 more

Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot
be cast to java.io.InputStream

 at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

 at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

 ... 9 more





My data-config file looks as follows:





  

  

  







  



  

  





And in the Schema I basically have two fields:









Any help is appreciated.





*Martin Frank Hansen*



Beskyttelse af dine personlige oplysninger er vigtig for os. Her
finder du KMD’s
Privatlivspolitik , der fortæller,
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s
Privacy Policy  outlining how we process
your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
dig slette e-mailen i dit system uden at videresende 

Re: [VOTE] Release Apache Tika 1.19.1 Candidate #2

2018-10-09 Thread Oleg Tikhonov
sorry.
+1

On Tue, Oct 9, 2018 at 7:26 PM Tim Allison  wrote:

> Thank you, Dave!
>
> Fellow devs, would anyone else have a chance to vote?  We need a third
> for the release.  Thank you!
> On Mon, Oct 8, 2018 at 4:36 AM  wrote:
> >
> > Hello,
> >
> > On Thu, 4 Oct 2018 at 23:03, Tim Allison  wrote:
> >>
> >> A candidate for the Tika 1.19.1 release is available at:
> >>   https://dist.apache.org/repos/dist/dev/tika/
> >>
> >> The release candidate is a zip archive of the sources in:
> >>   https://github.com/apache/tika/tree/1.19.1-rc2/
> >>
> >> The SHA-512 checksum of the archive is
> >>
>  
> 4f89216eb3332288c4839139e4af78395fefb3c03be4a6d41a8c9ffadebf69e1732afced25e7fe3c563fb6ce95726a89bd9924c69ddab8e6875a45eec1564fcb
> >>
> >> In addition, a staged maven repository is available here:
> >>
> https://repository.apache.org/content/repositories/orgapachetika-1045/org/apache/tika
> >>
> >> Please vote on releasing this package as Apache Tika 1.19.1.
> >>
> >> The vote is open for the next 72 hours and passes if a majority of at
> >> least three +1 Tika PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Tika 1.19.1
> >> [ ] -1 Do not release this package because...
> >
> >
> > +1 from me.
> >
> > Thanks for rolling the release Tim!
> >
> > Cheers,
> > Dave
>


Re: Release Announcement: General Availability of JDK 11

2018-09-26 Thread Oleg Tikhonov
Good news!!!

On Thu, Sep 27, 2018, 00:06 Tim Allison  wrote:

> +1 successful build
> On Wed, Sep 26, 2018 at 5:20 AM Rory O'Donnell 
> wrote:
> >
> > Hi Tim,
> >
> > *1) Release Announcement: General Availability of JDK 11 *
> >
> >   * JDK 11, the reference implementation of Java 11 and the first
> > long-term support release produced under the six-month rapid-cadence
> > release model [1][2], is now Generally Available.
> >   * GPL-licensed OpenJDK builds from Oracle are available here:
> > https://jdk.java.net/11
> >
> > This release includes seventeen features:
> >
> >   * 181: Nest-Based Access Control 
> >   * 309: Dynamic Class-File Constants 
> >   * 315: Improve Aarch64 Intrinsics 
> >   * 318: Epsilon: A No-Op Garbage Collector
> > 
> >   * 320: Remove the Java EE and CORBA Modules
> > 
> >   * 321: HTTP Client (Standard) 
> >   * 323: Local-Variable Syntax for Lambda Parameters
> > 
> >   * 324: Key Agreement with Curve25519 and Curve448
> > 
> >   * 327: Unicode 10 
> >   * 328: Flight Recorder 
> >   * 329: ChaCha20 and Poly1305 Cryptographic Algorithms
> > 
> >   * 330: Launch Single-File Source-Code Programs
> > 
> >   * 331: Low-Overhead Heap Profiling 
> >   * 332: Transport Layer Security (TLS) 1.3
> > 
> >   * 333: ZGC: A Scalable Low-Latency Garbage Collector (Experimental)
> > 
> >   * 335: Deprecate the Nashorn JavaScript Engine
> > 
> >   * 336: Deprecate the Pack200 Tools and API
> > 
> >
> >
> > 2) Quality Outreach Report for September 2018 is available*
> > *
> >
> >   * Quality Outreach report September 2018
> >
> > *Thanks to everyone who contributed to JDK 11 by downloading and testing
> > the early-access builds.
> > In particular the following developers who logged **18 issues in the JDK
> > Bug System.*
> >
> >   * Netty
> >   * Eclipse Jetty
> >   * Apache Lucene
> >   * JUnit5
> >   * Apache Tomcat
> >   * Apache Ant
> >   * Apache POI
> >   * AssertJ
> >   * Eclipse Collections
> >   * Byte Buddy
> >   * RxJava
> >
> > 3) JDK 12 EA build 12, under both the GPL and Oracle EA licenses, are
> > now available at http://jdk.java.net/11 .
> >
> >   * Schedule , Status & Features
> >   o http://openjdk.java.net/projects/jdk/12/
> >   * Release Notes:
> >   o http://jdk.java.net/12/release-notes
> >
> > **
> >
> > Rgds,Rory
> >
> > --
> > Rgds,Rory O'Donnell
> > Quality Engineering Manager
> > Oracle EMEA, Dublin,Ireland
> >
>


Re: [jira] [Created] (TIKA-2730) parseToString fails for a simple mp3

2018-09-19 Thread Oleg Tikhonov
Hi,
It would be great, if you could attach such a file. Or does it fails on any?


On Wed, Sep 19, 2018, 13:13 Boris Petrov (JIRA)  wrote:

> Boris Petrov created TIKA-2730:
> --
>
>  Summary: parseToString fails for a simple mp3
>  Key: TIKA-2730
>  URL: https://issues.apache.org/jira/browse/TIKA-2730
>  Project: Tika
>   Issue Type: Bug
> Affects Versions: 1.19
> Reporter: Boris Petrov
>  Attachments: demo.mp3
>
> This is a regression from 1.18. I've attached the mp3 that fails. The
> exception I get is:
> {noformat}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> at com.company.TextExtractor.getText(TextExtractor.java:39)
>
> Caused by:
> java.io.EOFException: EOF: tried to skip 361 but could only skip 247
> at
> org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)
> at
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)
> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 5 more{noformat}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [VOTE] Release Apache Tika 1.19 Candidate #1

2018-09-17 Thread Oleg Tikhonov
Hi Tim,
thanks !

[INFO] Apache Tika parent . SUCCESS [
5.138 s]
[INFO] Apache Tika core ... SUCCESS [
58.722 s]
[INFO] Apache Tika parsers  SUCCESS [04:20
min]
[INFO] Apache Tika XMP  SUCCESS [
10.705 s]
[INFO] Apache Tika serialization .. SUCCESS [
6.820 s]
[INFO] Apache Tika batch .. SUCCESS [02:32
min]
[INFO] Apache Tika language detection . SUCCESS [
5.612 s]
[INFO] Apache Tika application  SUCCESS [01:27
min]
[INFO] Apache Tika OSGi bundle  SUCCESS [
47.224 s]
[INFO] Apache Tika translate .. SUCCESS [
5.712 s]
[INFO] Apache Tika server . SUCCESS [01:23
min]
[INFO] Apache Tika examples ... SUCCESS [
24.945 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
6.356 s]
[INFO] Apache Tika eval ... SUCCESS [
51.488 s]
[INFO] Apache Tika Deep Learning (powered by DL4J)  SUCCESS [05:41
min]
[INFO] Apache Tika Natural Language Processing  SUCCESS [
56.145 s]
[INFO] Apache Tika  SUCCESS [
0.088 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 20:09 min
[INFO] Finished at: 2018-09-17T17:47:18+03:00
[INFO] Final Memory: 187M/1674M
+1 To release.

Did only basic stuff, centOS 7.4

Oleg

On Sat, Sep 15, 2018 at 2:42 PM Tim Allison  wrote:

> A candidate for the Tika 1.19 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.19-rc1/
>
> The SHA-512 checksum of the archive is
>
> b0ec5f1746ceb002e3f33d2a55680952dad63ec9421f5245d28e33398d077547b88a6f521a4b76563f38bf887aa33b8a07de318c5c546039623be3ae65d34eec.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1036/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 1.19.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.19
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Cheers,
>
>   Tim
>


Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-07 Thread Oleg Tikhonov
Yep, seems to be best match... unblocked execution.


On Thu, Sep 6, 2018, 23:47 Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606373#comment-16606373
> ]
>
> Tim Allison commented on TIKA-2725:
> ---
>
> {quote}
> Ideally, tika server is dockerized, runs on swarm as a service. In
> addition, it has healthckeck mechanism, say something ... like http get
> request with return code 200. Docker will runs this hc periodically, and if
> it fails, will restart tika server.
> However, we are far away. Two ways to go, fmpov ... 1. Your second option
> or ... os deamon which will check tika server availability or something
> like that. We can use cron on Linux to run our "healthcheck" and if it
> detects some anomalies, will restart a server. Probably for windows we can
> find such mecanism as well.
> {quote}
>
> CommonsExec?
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > 
> >
> > Key: TIKA-2725
> > URL: https://issues.apache.org/jira/browse/TIKA-2725
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
Ideally, tika server is dockerized, runs on swarm as a service. In
addition, it has healthckeck mechanism, say something ... like http get
request with return code 200. Docker will runs this hc periodically, and if
it fails, will restart tika server.
However, we are far away. Two ways to go, fmpov ... 1. Your second option
or ... os deamon which will check tika server availability or something
like that. We can use cron on Linux to run our "healthcheck" and if it
detects some anomalies, will restart a server. Probably for windows we can
find such mecanism as well.


On Thu, Sep 6, 2018, 18:29 Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605925#comment-16605925
> ]
>
> Tim Allison commented on TIKA-2725:
> ---
>
> bq. What is tika-server typical env? stand-alone, distributed ... like
> replicas in cluster?
>
> It varies, I'm sure.  Not sure what most common use case is.  I would hope
> distributed -- swarm or similar.
>
> bq. Are there some time limitation for recovery?
>
> I think whoever starts the server should be able to set the threshold for
> timeouts per file...although I may misunderstand your question.
>
> bq.  How do we know what point to start processing from?
> That wouldn't be tika-server's problem.  Clients calling tika-server would
> get an error message, or potentially no response within a socket/http
> timeout range.  They should not reprocess those docs.
>
> bq. Do we mark documents which were processed?
> Same as above, that's a client concern.
>
> bq. For example, if tika-server had run on Docker swarm/K8S then
> orchestrator would have restarted a failed replica itself
> To confirm that I understand this correctly, currently, if the tika-server
> process dies, swarm/k8s will automatically restart it?  That's good to
> hear.  However, we don't currently have the watcher thread within
> tika-server to kill its own process on oom/timeout...so as it is now, it
> would have to be something catastrophic taking down tika-server (operating
> system, perhaps?).
>
>
>
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > 
> >
> > Key: TIKA-2725
> > URL: https://issues.apache.org/jira/browse/TIKA-2725
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
In this approach, probably it is the only way ...
What is tika-server typical env? stand-alone, distributed ... like replicas
in cluster?
Are there some time limitation for recovery? How do we know what point to
start processing from?
Do we mark documents which were processed?
For example, if tika-server had run on Docker swarm/K8S then orchestrator
would have restarted a failed replica itself ...


On Thu, Sep 6, 2018 at 4:58 PM Tim Allison (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605816#comment-16605816
> ]
>
> Tim Allison commented on TIKA-2725:
> ---
>
> From [~o...@apache.org] on the dev list:
>
> bq. What if watcher thread fails/gets stuck etc?
>
> To confirm, that's the watcher thread in the child process.  Y, that's why
> I think we should also have a ping from the parent process.  WDYT?
>
> > Make tika-server robust against ooms/infinite loops/memory leaks
> > 
> >
> > Key: TIKA-2725
> > URL: https://issues.apache.org/jira/browse/TIKA-2725
> > Project: Tika
> >  Issue Type: Task
> >Reporter: Tim Allison
> >Assignee: Tim Allison
> >Priority: Major
> >
> > Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
> > 1) use the ForkParser
> > 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
> > I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
> > Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Created] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks

2018-09-06 Thread Oleg Tikhonov
Hi Tim,
What if watcher thread fails/gets stuck etc?



On Thu, Sep 6, 2018 at 3:27 PM Tim Allison (JIRA)  wrote:

> Tim Allison created TIKA-2725:
> -
>
>  Summary: Make tika-server robust against ooms/infinite
> loops/memory leaks
>  Key: TIKA-2725
>  URL: https://issues.apache.org/jira/browse/TIKA-2725
>  Project: Tika
>   Issue Type: Task
> Reporter: Tim Allison
> Assignee: Tim Allison
>
>
> Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks.  I see two ways of making it robust:
>
> 1) use the ForkParser
> 2) have tika-server spawn a child process that actually runs the server,
> put a watcher thread in the child that will kill the child on
> oom/timeout/after x files.  The parent process can then restart the child
> if it dies.
>
> I somewhat prefer 2) so that we don't have to doubly pass the
> inputstream.  I propose 2), and I propose making it optional in Tika 1.x,
> but then the default in Tika 2.x.  We could also add a status ping from
> parent to child in case the child gets caught up in stop the world gc (h/t
> [~bleskes]).
>
> Other options/recommendations?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [jira] [Created] (TIKA-2647) Create a "security" page on our website

2018-05-22 Thread Oleg Tikhonov
Hi Tim,
definitely would be helpful !
+1
Thanks,
Oleg

On Tue, May 22, 2018 at 3:38 PM, Tim Allison (JIRA)  wrote:

> Tim Allison created TIKA-2647:
> -
>
>  Summary: Create a "security" page on our website
>  Key: TIKA-2647
>  URL: https://issues.apache.org/jira/browse/TIKA-2647
>  Project: Tika
>   Issue Type: New Feature
> Reporter: Tim Allison
>
>
> I think it would be helpful for us to document any CVEs we've had on one
> central page on our website.  WDYT?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: [VOTE] Release Apache Tika 1.18 Candidate #3

2018-04-22 Thread Oleg Tikhonov
Hi,
thanks a lot.
[x] +1 Release this package as Apache Tika 1.18

Even did a security scan:
mvn org.owasp:dependency-check-maven:3.1.2:check

Report is attached.

Best regards,
Oleg


On Sat, Apr 21, 2018 at 12:54 AM, talli...@apache.org 
wrote:

> All,
> A candidate for the Tika 1.18 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/1.18-rc3
> The SHA-512 checksum of the archive isf69ee27b31cf7bcb1eaf114b93c23d
> d85b974356cc7e6e265b1c9366a11d711a3341e690f5b452a3e8b0c5cc6f
> 5839db01b3ef6ec3a2a29ffcd332ff7a63dcf3.
> In addition, a staged maven repository is available here:
> https://repository.apache.org/content/repositories/orgapachetika-1033
> Please vote on releasing this package as Apache Tika 1.18.The vote is open
> for the next 72 hours and passes if a majority of atleast three +1 Tika PMC
> votes are cast.
> [ ] +1 Release this package as Apache Tika 1.18[ ] -1 Do not release this
> package because...
> +1 from me; third time's the charm...
> Cheers,
> Tim


Re: [VOTE] Release Apache Tika 1.18 Candidate #1

2018-04-11 Thread Oleg Tikhonov
[+] Release this package as Apache Tika 1.18

[INFO] Apache Tika parent . SUCCESS [
12.379 s]
[INFO] Apache Tika core ... SUCCESS [
55.650 s]
[INFO] Apache Tika parsers  SUCCESS [05:55
min]
[INFO] Apache Tika XMP  SUCCESS [
7.254 s]
[INFO] Apache Tika serialization .. SUCCESS [
3.857 s]
[INFO] Apache Tika batch .. SUCCESS [02:13
min]
[INFO] Apache Tika language detection . SUCCESS [
8.152 s]
[INFO] Apache Tika application  SUCCESS [01:13
min]
[INFO] Apache Tika OSGi bundle  SUCCESS [
57.625 s]
[INFO] Apache Tika translate .. SUCCESS [
8.393 s]
[INFO] Apache Tika server . SUCCESS [01:05
min]
[INFO] Apache Tika examples ... SUCCESS [
19.053 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
5.646 s]
[INFO] Apache Tika eval ... SUCCESS [
44.564 s]
[INFO] Apache Tika Deep Learning (powered by DL4J)  SUCCESS [07:45
min]
[INFO] Apache Tika Natural Language Processing  SUCCESS [01:47
min]
[INFO] Apache Tika  SUCCESS [
0.145 s]
[INFO] 

[INFO] BUILD SUCCESS

CentOS 7.3. Did only basic stuff.

I've seen that we have Docker image build script. Is there some
documentation?
I will dig into it ...
Thanks a lot,
Oleg

On Tue, Apr 10, 2018 at 3:36 PM, Tim Allison  wrote:

> A candidate for the Tika 1.18 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.18-rc1/
>
> The SHA-512 checksum of the archive is
>   7f2e76e2973c9a0c3ba572afa74686ff95f0628136940b592c61d3639fe8
> 123f977fe321693a6c02a650172f3ef442e7a3adfa93d81d1d770233e47d8911b79e.
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachetika-1031
>
>
>
> Please vote on releasing this package as Apache Tika 1.18.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.18
> [ ] -1 Do not release this package because...
>
> Here's my +1
>
> On behalf of the Apache Tika team,
>
>  Tim
>


Re: tsdb extraction

2018-03-29 Thread Oleg Tikhonov
ok. time to read the spec :-)

On Thu, Mar 29, 2018 at 4:02 PM, Allison, Timothy B. 
wrote:

> Sorry...not aware of anything...
>
> -Original Message-
> From: olegtikho...@gmail.com [mailto:olegtikho...@gmail.com] On Behalf Of
> Oleg Tikhonov
> Sent: Thursday, March 29, 2018 1:46 AM
> To: tika-...@lucene.apache.org
> Subject: tsdb extraction
>
> Hi guys,
> I am wondering if we have a parser which can deal with time series, like
> influxDB or Prometheus?
>
> May be you know such "work in progress" - it's also good.
>
> Thanks in advance,
> Oleg
>


tsdb extraction

2018-03-28 Thread Oleg Tikhonov
Hi guys,
I am wondering if we have a parser which can deal with time series, like
influxDB or Prometheus?

May be you know such "work in progress" - it's also good.

Thanks in advance,
Oleg


Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-12 Thread Oleg Tikhonov
[x]+1  Release this package as Apache Tika 1.16
Basic tests and build on Ubuntu 17.04 + Java 8 (Oracle).

Thanks,
Oleg

On Wed, Jul 12, 2017 at 11:03 AM, Dave Meikle  wrote:

> On 8 July 2017 at 03:40, Tim Allison  wrote:
>
> >
> > A candidate for the Tika 1.16 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/1.16-rc1
> >
> > The SHA1 checksum of the archive is
> > e6884af0209ace42bf0b9b59d72c3c5a0052055e
> >
> > In addition, a staged maven repository is available here:
> > https://repository.apache.org/content/repositories/orgapachetika-1025
> >
> > Please vote on releasing this package as Apache Tika 1.16.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.16
> > [ ] -1 Do not release this package because...
> >
> >
> +1 from me. Checksums and signatures good. Built and tested on various
> machines using Java 8. Been run in a production workload and all good.
>
> Cheers,
> Dave
>


Re: experiences with Tika in Docker

2017-06-02 Thread Oleg Tikhonov
Guys, i can help with Tika dockerization. just let design/plan what we
gonna do.

On Thu, Jun 1, 2017 at 4:02 PM, Eric Pugh 
wrote:

> As the Tika project starts embracing more non Java tools (I’m thinking of
> Tesseract for example), dockerizing your Tika setup becomes more and more
> valuable.
>
> For example, I run my tests for my application on my local Mac, as well as
> on CircleCI.   I have a dockeriezed Tika service that does the OCR stuff,
> and I know it’s the same work on both.   It’s less exciting if I’m in an
> “all Java” world.
>
>
> > On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. 
> wrote:
> >
> > Thank you, Thejan!
> >
> > -Original Message-
> > From: Thejan Wijesinghe [mailto:thejan.k.wijesin...@gmail.com]
> > Sent: Wednesday, May 31, 2017 5:40 PM
> > To: dev@tika.apache.org
> > Subject: Re: experiences with Tika in Docker
> >
> > Hi Tim,
> >
> > I've used Tika -server in docker but as a single instance only. Yes, its
> ability to limit container's resources with related to memory & CPU in the
> host machine is great, it gives us so much flexibility, we could enforce
> hard/soft memory limits, we could even manipulate the host machine's CPU
> cycles. Yes, it also limits risks of executing arbitrary code & XXE
> vulnerabilities. I already asked Prof. Chris Mattmann about officially
> moving to dockerhub. He said I need to make a mail to apache infra asking
> about this. Unfortunately, I still couldn't find a time to make that mail.
> >
> > We already have multiple dockerfiles in Tika, , dockerfile in
> tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile,
> Im2txtRestDockerfile(PR #180-for image captioning).
> >
> > Part of my GSoC project is to unify the existing REST services such as
> object recognition, image captioning. My idea is to unify all of those REST
> services where the user can start/terminate, see statistics of any REST
> service through a web based GUI. I'm expecting to use a fusion of nginx(as
> the reverse proxy server) & docker to make it work. So obviously we will
> see docker much often in Tika.
> >
> > +1 for your thought to looking into hardening the tika-server with the
> > +help
> > of docker.
> >
> > best,
> > ThejanW
> >
> > On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. 
> > wrote:
> >
> >> Dave Meikle, Tom and All,
> >>
> >>How many of us are using Tika in Docker?  If so, how exactly are
> >> you using it?  Single instance, swarm, Kubernetes, something else?
> >> People fear I/O hit with tika-server...what are your experiences?
> >> I really like the ability to limit the number of CPUs in the Docker
> >> container.  If a single doc causes multithreaded gc to go nuts, that
> >> won't kill an entire machine.  This also cleanly limits the risk from
> >> XXE or arbitrary code execution, right?
> >>
> >> If this is one of the ways of the future for big data, we might want
> >> to look into hardening tika-server (OOMs, timeouts).  What do you all
> think?
> >>
> >>Cheers,
> >>
> >>Tim
> >>
> >> Timothy B. Allison, Ph.D.
> >> Principal Artificial Intelligence Engineer Group Lead K83E/Human
> >> Language Technology The MITRE Corporation
> >> 7515 Colshire Drive, McLean, VA  22102
> >> 703-983-2473 (phone); 703-983-1379 (fax)
> >>
> >>
>
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com  opensourceconnections.com/> | My Free/Busy 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-
> enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>


Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-24 Thread Oleg Tikhonov
Cannot reproduce after having done some workarounds ...



On Wed, May 24, 2017 at 3:05 AM, Allison, Timothy B. 
wrote:

> Hi Oleg,
>   What's your error on that unit test?
>
> -Original Message-
> From: olegtikho...@gmail.com [mailto:olegtikho...@gmail.com] On Behalf Of
> Oleg Tikhonov
> Sent: Tuesday, May 23, 2017 4:33 PM
> To: dev@tika.apache.org
> Subject: Re: [VOTE] Release Apache Tika 1.15 Candidate #1
>
> Also put
> ./tika-dl/src/test/java/org/apache/tika/dl/imagerec/
> DL4JInceptionV3NetTest.java
> @Ignore because I do not have any DL installed on my comp.
>
>
> On Tue, May 23, 2017 at 11:00 PM, Oleg Tikhonov  wrote:
>
> > Hi guys,
> > Here is wrong ...
> > 
> > org.apache.tika
> > tika-parent
> > 1.16-SNAPSHOT
> > tika-parent/pom.xml
> >   
> >
> >
> > If you are cloning the project, the upper level pom contains this.
> > The fix is to change 1.16-SNAPSHOT to 1.15
> >
> > What i did was:
> > git clone https://github.com/apache/tika.git
> >
> > Any suggestions?
> >
> > BR,
> > OLeg
> >
> >
> >
> >
> > On Tue, May 23, 2017 at 3:01 PM, Allison, Timothy B.
> > 
> > wrote:
> >
> >> I _think_ it is included.  See below for the two options for parsing
> >> testZipEncrypted.zip.
> >>
> >> Are you not seeing this behavior?  Were you expecting different
> behavior?
> >>
> >>
> >> 1) RecursiveParserWrapper
> >>
> >> List metadataList = getRecursiveMetadata("testZipE
> >> ncrypted.zip");
> >> debug(metadataList);
> >>
> >> yields:
> >>
> >> 0: X-Parsed-By : org.apache.tika.parser.DefaultParser
> >> 0: X-Parsed-By : org.apache.tika.parser.pkg.PackageParser
> >> 0: X-TIKA:EXCEPTION:embedded_stream_exception :
> >> org.apache.tika.exception.EncryptedDocumentException: stream
> >> (encrypted.txt) is encrypted
> >> at
> >> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageP
> >> arser.java:306)
> >> at
> >> org.apache.tika.parser.pkg.PackageParser.parse(PackageParser
> >> .java:230)
> >> at
> >> org.apache.tika.parser.CompositeParser.parse(CompositeParser
> >> .java:280)
> >> at
> >> org.apache.tika.parser.CompositeParser.parse(CompositeParser
> >> .java:280)
> >> at
> >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectPars
> >> er.java:135)
> >> at
> >> org.apache.tika.parser.RecursiveParserWrapper.parse(Recursiv
> >> eParserWrapper.java:158)
> >> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
> >> 221)
> >> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
> >> 213)
> >> at
> >> org.apache.tika.parser.pkg.ZipParserTest.testZipEncrypted(Zi
> >> pParserTest.java:213)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> >> ssorImpl.java:62)
> >> at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> >> thodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:498)
> >> at
> >> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
> >> FrameworkMethod.java:50)
> >> at
> >> org.junit.internal.runners.model.ReflectiveCallable.run(Refl
> >> ectiveCallable.java:12)
> >> at
> >> org.junit.runners.model.FrameworkMethod.invokeExplosively(Fr
> >> ameworkMethod.java:47)
> >> at
> >> org.junit.internal.runners.statements.InvokeMethod.evaluate(
> >> InvokeMethod.java:17)
> >> at org.junit.internal.runners.statements.RunBefores.evaluate(
> >> RunBefores.java:26)
> >> at org.junit.runners.ParentRunner.runLeaf(
> ParentRunner.java:325)
> >> at
> >> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
> >> 4ClassRunner.java:78)
> >> at
> >> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
> >> 4ClassRunner.java:57)
> >> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> >> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:
> >> 71)
> >> at org.juni

Re: [VOTE] Release Apache Tika 1.15 Candidate #2

2017-05-24 Thread Oleg Tikhonov
[x] +1 Release this package as Apache Tika 1.15

[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 19:41 min
[INFO] Finished at: 2017-05-24T22:22:17+03:00
[INFO] Final Memory: 116M/983M
[INFO]


Tested on Ubuntu 1.16 x86_64

Thanks !!!



On Wed, May 24, 2017 at 4:22 AM, Tim Allison  wrote:

> A second candidate for the Tika 1.15 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/1.15-rc2/
>
> The SHA1 checksum of the archive is
> e283468e47855f9142578c126e12f02eb5b08d2b.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/
> orgapachetika-1023/org/apache/tika/
>
> Please vote on releasing this package as Apache Tika 1.15.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.15
> [ ] -1 Do not release this package because...
>
>
> -Tim
>
> P.S. This is my +1.
>


Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-23 Thread Oleg Tikhonov
Also put
./tika-dl/src/test/java/org/apache/tika/dl/imagerec/DL4JInceptionV3NetTest.java
@Ignore because I do not have any DL installed on my comp.


On Tue, May 23, 2017 at 11:00 PM, Oleg Tikhonov  wrote:

> Hi guys,
> Here is wrong ...
> 
> org.apache.tika
> tika-parent
> 1.16-SNAPSHOT
> tika-parent/pom.xml
>   
>
>
> If you are cloning the project, the upper level pom contains this.
> The fix is to change 1.16-SNAPSHOT to 1.15
>
> What i did was:
> git clone https://github.com/apache/tika.git
>
> Any suggestions?
>
> BR,
> OLeg
>
>
>
>
> On Tue, May 23, 2017 at 3:01 PM, Allison, Timothy B. 
> wrote:
>
>> I _think_ it is included.  See below for the two options for parsing
>> testZipEncrypted.zip.
>>
>> Are you not seeing this behavior?  Were you expecting different behavior?
>>
>>
>> 1) RecursiveParserWrapper
>>
>> List metadataList = getRecursiveMetadata("testZipE
>> ncrypted.zip");
>> debug(metadataList);
>>
>> yields:
>>
>> 0: X-Parsed-By : org.apache.tika.parser.DefaultParser
>> 0: X-Parsed-By : org.apache.tika.parser.pkg.PackageParser
>> 0: X-TIKA:EXCEPTION:embedded_stream_exception :
>> org.apache.tika.exception.EncryptedDocumentException: stream
>> (encrypted.txt) is encrypted
>> at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageP
>> arser.java:306)
>> at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser
>> .java:230)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser
>> .java:280)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser
>> .java:280)
>> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectPars
>> er.java:135)
>> at org.apache.tika.parser.RecursiveParserWrapper.parse(Recursiv
>> eParserWrapper.java:158)
>> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
>> 221)
>> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:
>> 213)
>> at org.apache.tika.parser.pkg.ZipParserTest.testZipEncrypted(Zi
>> pParserTest.java:213)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
>> FrameworkMethod.java:50)
>> at org.junit.internal.runners.model.ReflectiveCallable.run(Refl
>> ectiveCallable.java:12)
>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(Fr
>> ameworkMethod.java:47)
>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(
>> InvokeMethod.java:17)
>> at org.junit.internal.runners.statements.RunBefores.evaluate(
>> RunBefores.java:26)
>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
>> 4ClassRunner.java:78)
>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit
>> 4ClassRunner.java:57)
>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:
>> 71)
>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.
>> java:288)
>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:
>> 58)
>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:
>> 268)
>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>> at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>> at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs
>> (JUnit4IdeaTestRunner.java:68)
>> at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.star
>> tRunnerWithArgs(IdeaTestRunner.java:51)
>> at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsA
>> ndStart(JUnitStarter.java:242)
>> at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStart
>> er.java:70)
>>
>> 0: X-TIKA:parse_time_millis : 34
>> 0: X-TIKA:content : http://www.w3.org/1999/xhtml";>
>> 
>> > />
>> > />
>> 
>> 
>> 
>> 
>> unencrypte

Re: [VOTE] Release Apache Tika 1.15 Candidate #1

2017-05-23 Thread Oleg Tikhonov
Hi guys,
Here is wrong ...

org.apache.tika
tika-parent
1.16-SNAPSHOT
tika-parent/pom.xml
  


If you are cloning the project, the upper level pom contains this.
The fix is to change 1.16-SNAPSHOT to 1.15

What i did was:
git clone https://github.com/apache/tika.git

Any suggestions?

BR,
OLeg




On Tue, May 23, 2017 at 3:01 PM, Allison, Timothy B. 
wrote:

> I _think_ it is included.  See below for the two options for parsing
> testZipEncrypted.zip.
>
> Are you not seeing this behavior?  Were you expecting different behavior?
>
>
> 1) RecursiveParserWrapper
>
> List metadataList = getRecursiveMetadata("
> testZipEncrypted.zip");
> debug(metadataList);
>
> yields:
>
> 0: X-Parsed-By : org.apache.tika.parser.DefaultParser
> 0: X-Parsed-By : org.apache.tika.parser.pkg.PackageParser
> 0: X-TIKA:EXCEPTION:embedded_stream_exception : 
> org.apache.tika.exception.EncryptedDocumentException:
> stream (encrypted.txt) is encrypted
> at org.apache.tika.parser.pkg.PackageParser.parseEntry(
> PackageParser.java:306)
> at org.apache.tika.parser.pkg.PackageParser.parse(
> PackageParser.java:230)
> at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:135)
> at org.apache.tika.parser.RecursiveParserWrapper.parse(
> RecursiveParserWrapper.java:158)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.
> java:221)
> at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.
> java:213)
> at org.apache.tika.parser.pkg.ZipParserTest.testZipEncrypted(
> ZipParserTest.java:213)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
> FrameworkMethod.java:50)
> at org.junit.internal.runners.model.ReflectiveCallable.run(
> ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(
> FrameworkMethod.java:47)
> at org.junit.internal.runners.statements.InvokeMethod.
> evaluate(InvokeMethod.java:17)
> at org.junit.internal.runners.statements.RunBefores.
> evaluate(RunBefores.java:26)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(
> BlockJUnit4ClassRunner.java:78)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(
> BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(
> ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(
> ParentRunner.java:268)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(
> JUnit4IdeaTestRunner.java:68)
> at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.
> startRunnerWithArgs(IdeaTestRunner.java:51)
> at com.intellij.rt.execution.junit.JUnitStarter.
> prepareStreamsAndStart(JUnitStarter.java:242)
> at com.intellij.rt.execution.junit.JUnitStarter.main(
> JUnitStarter.java:70)
>
> 0: X-TIKA:parse_time_millis : 34
> 0: X-TIKA:content : http://www.w3.org/1999/xhtml";>
> 
> 
>  />
> 
> 
> 
> 
> unencrypted.txt
> 
> encrypted.txt
> 
> 0: Content-Type : application/zip
> 1: date : 2017-03-21T13:07:48Z
> 1: X-Parsed-By : org.apache.tika.parser.DefaultParser
> 1: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
> 1: resourceName : unencrypted.txt
> 1: dcterms:modified : 2017-03-21T13:07:48Z
> 1: Last-Modified : 2017-03-21T13:07:48Z
> 1: Last-Save-Date : 2017-03-21T13:07:48Z
> 1: embeddedRelationshipId : unencrypted.txt
> 1: meta:save-date : 2017-03-21T13:07:48Z
> 1: Content-Encoding : windows-1252
> 1: X-TIKA:parse_time_millis : 3
> 1: modified : 2017-03-21T13:07:48Z
> 1: X-TIKA:content : http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> hello world
> 
> 
> 1: Content-Length : 13
> 1: X-TIKA:embedded_resource_path : /unencrypted.txt
> 1: Content-Type : text/plain; charset=windows-1252
>
> 2) Classic XML:
>
> XMLResult r = getXML("testZipEncrypted.zip");
> for (String n : r.metadata.names()) {
> for (String v : r.metadata.getValues(n)) {
> System.out.println("meta: "+n + "

Re: 1.15?

2017-04-17 Thread Oleg Tikhonov
+1 for the release.

On Mon, Apr 17, 2017 at 8:39 PM, David Meikle  wrote:

> +1 from me too.
>
> Cheers,
> Dave
>
> On 13 April 2017 at 13:08, Konstantin Gribov  wrote:
>
> > Preliminary +1 from me, I'll the a closer look this weekend
> >
> > чт, 13 апр. 2017, 0:00 Allison, Timothy B. :
> >
> > > All,
> > >   POI is voting on rc1 of the next release.  Once that's released and
> > > integrated into Tika, let's start the release process for Tika 1.15,
> end
> > of
> > > next week, middle of following?  Any blockers?
> > >
> > >  Cheers,
> > >
> > >  Tim
> > >
> > >
> > > --
> >
> > Best regards,
> > Konstantin Gribov
> >
>


Re: [DISCUSS] Contribution guide & style enforcement

2017-03-30 Thread Oleg Tikhonov
Definitely true, +1

On Wed, Mar 29, 2017 at 9:19 PM, Allison, Timothy B. 
wrote:

> +1  Y, thank you!
>
> -Original Message-
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
> Sent: Wednesday, March 29, 2017 2:07 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] Contribution guide & style enforcement
>
> Hi Konstantin,
>
> Thanks for the thoughtful and detailed writeup.
>
> And yes, +1 to all 5 top-level suggestions.
>
> — Ken
>
> > On Mar 29, 2017, at 10:39am, Konstantin Gribov 
> wrote:
> >
> > Hi, folks.
> >
> > Currently we have something like contribution guide parts in several
> > places (I thought about [1] and [2] and Chris also mentioned [3])
> > covering different facets of contributing to Apache Tika.
> >
> > One thing which make me upset is that we have very inconsistent
> > codebase with different style, formatting, dependency management. It
> > seems inevitable on some stage of any popular open source project
> > developed by many contributors. But we can make it more consistent
> > with moderate effort for maintaining status quo after.
> >
> > I propose:
> >
> >   1. make one source of truth about contribution guide and then
> >   automatically mirror it to README.md/CONTRIBUTING.md for github,
> publish on
> >   tika.a.o etc;
> >   2. add info about logging in tika-core and other packages to these
> >   contribution guide to make all contributions consistent with current
> policy
> >   (with examples how logging should be used in different modules):
> >  1. JUL in tika-core
> >  2. SLF4J in `private static final Logger LOG` field in all other
> >  modules;
> >  3. Allow to use logging backend (log4j) in tests (e.g. for tuning
> log
> >  levels for upstream libraries) and standalone application (e.g.
> > to support
> >  `--quiet` and `--verbose` CLI keys);
> >  4. Document logging configuration in case OSGi bundle is used;
> >   3. add info about dependency handling (e.g. no additional deps in
> >   tika-core policy, exlusion of commons-logging/commons-
> logging-api/log4j
> >   from dependencies etc);
> >   4. integrate checkstyle plugin [5], [6] to Maven build to allow
> >   contributors easily check that their code is conformant with simple
> policy
> >   to start (4 spaces indent, no TABs, spaces before opening braces,
> spaces
> >   after if/else/try/catch/finally, egyptian-style braces);
> >   5. add documentation about checkstyle [5] configuration in IDE to
> >   simplify it's usage (I can write one for JetBrains IDEA at least).
> >
> > Main point are to bring Tika codebase to more consistent and clear
> > state, simplify its maintainance and make it easier for contributors
> > to make clean and pretty patches. Checkstyle configuration should be
> > as simple as it can be to real to refactor.
> >
> > Also, these items should be integrated gradually, step by step.
> >
> > What do you think, folks?
> > Would it be good thing for Tika and its community?
> > Would it bring any serios challenges of which I've forgot?
> >
> > [1]: http://tika.apache.org/contribute.html
> > [2]: https://wiki.apache.org/tika/DeveloperResources
> > [3]: https://github.com/apache/tika/#contributing-via-github
> > [4]: https://issues.apache.org/jira/browse/TIKA-2316 tracking issue
> > [5]: http://checkstyle.sourceforge.net/
> > [6]: https://maven.apache.org/plugins/maven-checkstyle-plugin/
> >
> >
> >
> > --
> >
> > Best regards,
> > Konstantin Gribov
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>


Re: Master Build Failing

2016-10-25 Thread Oleg Tikhonov
hi Luis,
Here what  I did:
git clone https://git-wip-us.apache.org/repos/asf/tika.git
git branch
* master

gdalinfo --version
GDAL 1.11.3, released 2015/09/16

mvn clean install -U

Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 42.59 sec -
in org.apache.tika.parser.gdal.TestGDALParser
Running org.apache.tika.parser.executable.ExecutableParserTest


OS: Ubuntu 16, x86_64.






On Mon, Oct 24, 2016 at 8:57 PM, lewis john mcgibbney 
wrote:

> Hi Folks,
> Is master build failing for anyone? I got a brand new laptop and have GDAL
> installed.
> 
> ---
> Test set: org.apache.tika.parser.gdal.TestGDALParser
> 
> ---
> Tests run: 3, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 0.3 sec <<<
> FAILURE! - in org.apache.tika.parser.gdal.TestGDALParser
> testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser)  Time
> elapsed: 0.124 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at
> org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(
> TestGDALParser.java:79)
>
> testParseFITS(org.apache.tika.parser.gdal.TestGDALParser)  Time elapsed:
> 0.101 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at
> org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(
> TestGDALParser.java:165)
>
> testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser)  Time
> elapsed: 0.075 sec  <<< FAILURE!
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at
> org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(
> TestGDALParser.java:117)
>
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>


Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Oleg Tikhonov
Hi,
+1 for release.
Built on Ubuntu 16.04 and CentOS 7.0 x86_64.

All tests are passed. Java 8.

BR,
Oleg

On Thu, Oct 20, 2016 at 5:54 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Tim
>
> I had exiftool installed indeed, so that might explain it. All tests now
> pass. Will have a closer look at it all later.
>
> Thanks
>
> Julien
>
> On 20 October 2016 at 13:45, Allison, Timothy B. 
> wrote:
>
> > https://issues.apache.org/jira/browse/TIKA-2056
> >
> > Perhaps?
> >
> > -Original Message-
> > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> > Sent: Thursday, October 20, 2016 8:34 AM
> > To: dev@tika.apache.org
> > Subject: Re: [VOTE] Apache Tika 1.14 Release Candidate #1
> >
> > Hi
> >
> > Am getting the following when running 'mvn clean package', have I
> > forgotten something obvious?
> >
> > Julien
> >
> > *Failed tests: *
> > *  ForkParserIntegrationTest.testParserHandlingOfNonSerializable:210
> > expected: but
> > was:*
> *Tests
> > in error: *
> > *
> > ForkParserIntegrationTest.testAttachingADebuggerOnTheFor
> > kedParserShouldWork:234
> > » Tika*
> > *  ForkParserIntegrationTest.testForkedPDFParsing:257 » Tika Unable to
> > serialize ...*
> > *  ForkParserIntegrationTest.testForkedTextParsing:66 » Tika Unable to
> > serialize ...*
> >
> > *Tests run: 755, Failures: 1, Errors: 3, Skipped: 17*
> >
> > *[INFO]
> > 
> *
> > *[INFO] Reactor Summary:*
> > *[INFO] *
> > *[INFO] Apache Tika parent  SUCCESS
> > [4.368s]*
> > *[INFO] Apache Tika core .. SUCCESS
> > [16.487s]*
> > *[INFO] Apache Tika parsers ... FAILURE
> > [4:54.631s]*
> >
> >
> >
> > On 19 October 2016 at 19:48, Chris Mattmann  wrote:
> >
> > > Hi Folks,
> > >
> > > A first candidate for the Tika 1.14 release is available at:
> > >
> > >   https://dist.apache.org/repos/dist/dev/tika/
> > >
> > > The release candidate is a zip archive of the sources in:
> > >
> > > https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tree;hb=
> > > 687d7706c9778e4f49f2834a07e5a9d99b23042b
> > >
> > > The SHA1 checksum of the archive is:
> > > ad9152392ffe6b620c8102ab538df0579b36c520
> > >
> > > In addition, a staged maven repository is available here:
> > >
> > > https://repository.apache.org/content/repositories/orgapachetika-1020/
> > >
> > > Please vote on releasing this package as Apache Tika 1.14.
> > > The vote is open for the next 72 hours and passes if a majority of at
> > > least three +1 Tika PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Tika 1.14 [ ] -1 Do not release
> > > this package because..
> > >
> > > Cheers,
> > > Chris
> > >
> > > P.S. Of course here is my +1.
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> > *Open Source Solutions for Text Engineering*
> >
> > http://www.digitalpebble.com
> > http://digitalpebble.blogspot.com/
> > #digitalpebble 
> >
>
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble 
>


Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-28 Thread Oleg Tikhonov
Hi Chris,
thanks for doing it.
Yesterday I successfuly build the tika using mvn clean install.
All tests are passed. Platform: x86_64 Kubuntu with Oracle Java 8. Nothing
special was ran.

[x] +1 Release this package as Apache Tika 1.12

Best regards,
Oleg

On Mon, Jan 25, 2016 at 9:58 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A first candidate for the Tika 1.12 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24
> 27f9e84bc4ff31e569ae661c
>
>
> The SHA1 checksum of the archive is:
> 30e64645af643959841ac3bb3c41f7e64eba7e5f
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1015/
>
>
> Please vote on releasing this package as Apache Tika 1.12.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.12
> [ ] -1 Do not release this package because…
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>


Re: [DISCUSS] Moving to Git

2015-11-19 Thread Oleg Tikhonov
+1.
There is a bunch of add-ons. For instance - git flow.


On Wed, Nov 18, 2015 at 7:15 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Nick,
>
> Git has something similar to svn:externals:
>
> http://stackoverflow.com/questions/571232/svnexternals-equivalent-in-git
>
>
> I’ve seen both used in the same way. Also the examples site code
> is something we could always gin up a script solution to and isn’t
> a blocker by any means - it’s a smallish portion of the overall
> process and even if it had to be done by hand it’s something we don’t
> do often enough for it to be a real burden. I can speak from experience
> having done most or all of Tika’s releases.
>
> As to the discussions of what’s going on with Git/Github/version
> control, etc., the use of writeable Git repositories at the ASF
> has been sanctioned and used pervasively for years. That Git/Github
> /version control *policy* discussion is pretty independent of using
> the ASF’s own sanctioned writeable git repos on ASF hardware, which
> is all I’m proposing to do. AKA I’m proposing we move Tika’s
> canonical repo from:
>
> http://svn.apache.org/repos/asf/tika/
>
> TO:
>
> https://git-wip-us.apache.org/repos/asf/tika.git
>
> Infra has put policies (temporarily) in place to deal with any of
> the branching issues that have shown up etc. So there is already
> enforcement and so on. And like I said, the ASF has allowed writeable
> Git repos for many years now.
>
> Finally it seems like there is good support so far for this, so
> I’ll keep collecting feedback before calling an official vote maybe
> in the next few days. I’m really hoping there is really no big
> difference other than replacing svn co with git clone and replacing
> svn commit with git commit && git push in most places. One last note:
> many of the “issues” brought up on other projects or being discussed
> at a Foundation policy level are issues e.g., with the Incubator,
> some with newer (ish) TLPs that have arisen over the past few years
> and that are pushing the boundaries on how to use Git in ways that
> are forcing the foundation to ask questions at its core policy
> levels. That discussion is ongoing. Tika has been around since 2007,
> includes a strong set of ASF members, has seen the version control
> debates over the years and long since survived them, etc. I see no
> evidence and an extremely low probability that we will use writeable
> ASF git repos in any such way that drives the policy at the foundation
> level in the same way.
>
> Instead, I see pretty boring use of Git writeable repos to become
> more consistent with the way it seems like more and more of us are
> doing development (even today with Tika).
>
> HTH.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: Nick Burch 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, November 18, 2015 at 7:44 AM
> To: "dev@tika.apache.org" 
> Subject: Re: [DISCUSS] Moving to Git
>
> >On Wed, 18 Nov 2015, Mattmann, Chris A (3980) wrote:
> >> I propose we move to writeable git repos for Tika for our repository. I
> >> mostly interact with Git & Github nowadays even with Tika using the
> >> mirroring and PR interaction support.
> >
> >I'm -0 on this at the moment
> >
> >Having followed other Apache lists, it seems that there's quite a few
> >ways
> >to use Git, not all of them compatible with the Apache way, and some of
> >them easy to do wrong.
> >
> >Were we to have some proposed guidelines/information/rules on using Git
> >for Tika, such as about what branches squashing might be permitted on,
> >rules for that, information/rules on remote branches, how to handle /
> >when
> >to use / not-use private branches and github branches, and the like, then
> >I'd be minded to change my vote
> >
> >I'm also wondering how it would work with the website pulling in bits of
> >the Tika Examples module from SVN for the examples page? That currently
> >uses a svn:externals, so we can keep the code in a normal module + unit
> >test it, then pulls in snippets, how would that work if the code moved to
> >git?
> >
> >Nick
>
>


Re: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-25 Thread Oleg Tikhonov
Hi guys, all looks fine on basic set up in x86_64 Ubuntu, however I got the
following:
Running org.apache.tika.parser.journal.JournalParserTest
25 Oct 2015 10:45:53  WARN PhaseInterceptorChain - Interceptor for {
http://localhost:8080/grobid}WebClient has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
at
org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:64)
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at
org.apache.cxf.jaxrs.client.AbstractClient.doRunInterceptorChain(AbstractClient.java:623)
at
org.apache.cxf.jaxrs.client.WebClient.doChainedInvocation(WebClient.java:1084)
at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:883)
at org.apache.cxf.jaxrs.client.WebClient.doInvoke(WebClient.java:854)
at org.apache.cxf.jaxrs.client.WebClient.invoke(WebClient.java:320)
at org.apache.cxf.jaxrs.client.WebClient.get(WebClient.java:346)
at
org.apache.tika.parser.journal.GrobidRESTParser.canRun(GrobidRESTParser.java:102)
at
org.apache.tika.parser.journal.JournalParserTest.testJournalParser(JournalParserTest.java:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
at
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
at
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
at
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
Caused by: java.net.ConnectException: ConnectException invoking
http://localhost:8080/grobid: Connection refused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.mapException(HTTPConduit.java:1359)
at
org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1343)
at
org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:638)
at
org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
... 33 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java

Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member

2015-09-17 Thread Oleg Tikhonov
Good intro. Welcome a board.
Oleg
On 17 Sep 2015 03:05, "David Meikle"  wrote:

> Hello All,
>
> Please welcome Bob Paulin as he joins us as the latest Tika committer and
> PMC Member.
>
> Bob, please feel free to say a bit about yourself as an introduction to
> the group.
>
> Welcome aboard,
> Dave
>
>
>
>
>


Re: Remove support for building language identifier profiles?

2015-08-29 Thread Oleg Tikhonov
Hi Ken,
I would be choose the last option you've mentioned.

-- Oleg

On Sat, Aug 29, 2015 at 7:58 PM, Ken Krugler 
wrote:

> Hi all,
>
> As part of integrating language-detector into Tika (see TIKA-1723), I
> noticed TIKA-546 ("Add ability to create language profiles to tika-app")
>
> If we switch over to language-detector, then this code no longer makes
> sense.
>
> Also note that many language detectors require the full set of language
> data in order to generate the most relevant (discriminating) ngrams, thus
> the current support for passing in data for one language doesn't work.
>
> So any suggestions for what to do? Leave the code as is, with deprecated
> annotations, even though the profiles generated won't be useful?
>
> Or wait for pluggable detectors, and someone could port the current Tika
> code - then this profile building support might still make sense, though it
> would want to be moved into the specific plugin.
>
> -- Ken
>
>
>


Re: Apache Tika: In use at Goldman Sachs

2015-08-20 Thread Oleg Tikhonov
Wow !!! Amazing.
How does it perform?

BR,
Oleg

On Thu, Aug 20, 2015 at 9:48 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Just saw this online:
>
> http://www.informationweek.com/software/enterprise-applications/goldman-sac
> hs-puts-elasticsearch-to-work/d/d-id/1321778
>
>
> Apache Tika is a BIG part of this!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


Re: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-04 Thread Oleg Tikhonov
Hi, thanks for doing that !!!
+1 for the release.
Ran on Kubuntu 15 x64. All basic tests are passed.

BR,
Oleg

On Tue, Aug 4, 2015 at 6:17 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1 from me, great work Dave SIGS and CHECKSUMS are sound:
>
> [chipotle:~/tmp/tika-1.10-rc1] mattmann% /bin/bash
> bash-3.2$ for type in app server; do
> > for version in 1.10 1.10-src; do
> > /Users/mattmann/bin/stage_apache_rc tika-$type $version
> >https://dist.apache.org/repos/dist/dev/tika/
> > done
> > done
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100 45.0M  100 45.0M0 0  1481k  0  0:00:31  0:00:31 --:--:--
> 1937k
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100   819  100   8190 0   2057  0 --:--:-- --:--:-- --:--:--
> 2062
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 10033  100330 0 80  0 --:--:-- --:--:-- --:--:--
>  80
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100 50.5M  100 50.5M0 0  1586k  0  0:00:32  0:00:32 --:--:--
> 2134k
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100   819  100   8190 0   1910  0 --:--:-- --:--:-- --:--:--
> 1913
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 10033  100330 0 78  0 --:--:-- --:--:-- --:--:--
>  78
> bash-3.2$ ls
> tika-app-1.10.jar tika-app-1.10.jar.asc tika-app-1.10.jar.md5
>tika-server-1.10.jar  tika-server-1.10.jar.asc
> tika-server-1.10.jar.md5
> bash-3.2$ $HOME/bin/stage_apache_rc tika 1.10-src
> https://dist.apache.org/repos/dist/dev/tika/
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100 73.6M  100 73.6M0 0  2044k  0  0:00:36  0:00:36 --:--:--
> 2700k
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100   819  100   8190 0   1950  0 --:--:-- --:--:-- --:--:--
> 1950
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 10033  100330 0 77  0 --:--:-- --:--:-- --:--:--
>  78
> bash-3.2$ ls
> tika-1.10-src.zip tika-1.10-src.zip.md5 tika-app-1.10.jar.asc
>tika-server-1.10.jar  tika-server-1.10.jar.md5
> tika-1.10-src.zip.asc tika-app-1.10.jar tika-app-1.10.jar.md5
>tika-server-1.10.jar.asc
> bash-3.2$ exit
> exit
> [chipotle:~/tmp/tika-1.10-rc1] mattmann% $HOME/bin/verify_gpg_sigs
> Verifying Signature for file tika-1.10-src.zip.asc
> gpg: Signature made Sat Aug  1 23:34:31 2015 PDT using RSA key ID 0EB30B07
> gpg: Good signature from "David Meikle (CODE SIGNING KEY)
> "
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:  There is no indication that the signature belongs to the
> owner.
> Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
> Verifying Signature for file tika-app-1.10.jar.asc
> gpg: Signature made Sat Aug  1 23:24:15 2015 PDT using RSA key ID 0EB30B07
> gpg: Good signature from "David Meikle (CODE SIGNING KEY)
> "
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:  There is no indication that the signature belongs to the
> owner.
> Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
> Verifying Signature for file tika-server-1.10.jar.asc
> gpg: Signature made Sat Aug  1 23:30:05 2015 PDT using RSA key ID 0EB30B07
> gpg: Good signature from "David Meikle (CODE SIGNING KEY)
> "
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:  There is no indication that the signature belongs to the
> owner.
> Primary key fingerprint: F3F2 3C1E DB33 8077 254E  DEEC 5241 4B0B 0EB3 0B07
> [chipotle:~/tmp/tika-1.10-rc1] mattmann% $HOME/bin/verify_md5_checksums
> md5sum: stat '*.tar.gz': No such file or directory
> md5sum: stat '*.bz2': No such file or directory
> md5sum: stat '*.tgz': No such file or directory
> tika-1.10-src.zip: OK
> [chipotle:~/tmp/tika-1.10-rc1] mattmann%
>
>
>
>
>
> 

Re: release Tika 1.10?

2015-08-04 Thread Oleg Tikhonov
Thanks!
+1

BR,
Oleg

On Tue, Aug 4, 2015 at 5:37 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: "Allison, Timothy B." 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, July 28, 2015 at 11:08 AM
> To: "dev@tika.apache.org" 
> Subject: RE: release Tika 1.10?
>
> >Just finished the run against ~2.8 million docs (4.8 million including
> >attachments) from a combination of govdocs1 and Common Crawl.  I compared
> >1.9 with trunk.
> >
> >Most looks good.
> >
> >Some highlights:
> >* Thanks to Andrew Jackson and TIKA-1678, we're now getting better
> >metadata out of ~1300 from 550k PDFs. This appears to be far more common
> >in Common Crawl PDFs than in govdocs1 PDFs.
> >* No significant changes found in the handful of msg files...I wanted to
> >check after the work on TIKA-1238.
> >* Thanks to Andreas Beeker and TIKA-1046/POI 54332, there are far fewer
> >PPT exceptions
> >* There are a very few more files in CommonCrawl that are now incorrectly
> >identified as RFC vs text (TIKA-1602), but this is a tiny handful (total
> >of 4 documents in both CC and govdocs1)
> >
> >A regret:
> >This run used the digesting parser for both container and embedded files.
> > This causes some truncated (=corrupt) package files to throw an
> >exception before they otherwise would.  The opposite happens, too (more
> >embedded files when using the digester), but this is extremely rare. This
> >means that for truncated gz, x-xz and x-archive files there are many more
> >with fewer attachments in Tika 1.10-SNAPSHOT than in Tika 1.9.
> >
> >With Konstantin's and Bob's fix of TIKA-1524, I think we're in good shape
> >for 1.10...from my perspective.
> >
> > Best,
> >
> >   Tim
> >-Original Message-
> >From: David Meikle [mailto:loo...@gmail.com]
> >Sent: Sunday, July 26, 2015 10:50 AM
> >To: dev@tika.apache.org
> >Subject: Re: release Tika 1.10?
> >
> >
> >> On 23 Jul 2015, at 14:07, Allison, Timothy B. 
> >>wrote:
> >>
> >>  With the fix of TIKA-1690, I think it makes sense to roll a new
> >>release (1.10) in the next week or so.  I'd like to get TIKA-1667
> >>(upgrade poi) in before the release.  Are there any other blockers on
> >>1.10?
> >
> >+1 from me too.  As discussed on private, I will roll the release on
> >Tuesday night (UK Time) to give people time to shout for other candidates.
> >
> >Cheers,
> >Dave
>
>


Re: Bayesian N-Gram Language Detection

2015-07-29 Thread Oleg Tikhonov
+1 !!!
My two cents.
Please also add ability to change/retrain/tote language profiles.

Thanks !!!
BR,
Oleg

On Wed, Jul 29, 2015 at 3:59 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Cool. Well with this one I found, along with language-detector,
> along with Ramirez and the work with Joe Campbell’s group at MIT-LL
> and the Julia stuff, I for one am going to take the step to make it
> pluggable.
>
> I’ll try and take this on over the next week. I’ll use a ServiceLoader
> approach similar to Translators, Detectors, Parsers, etc.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: Ken Krugler 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, July 28, 2015 at 5:39 PM
> To: "dev@tika.apache.org" 
> Subject: RE: Bayesian N-Gram Language Detection
>
> >I think switching to language-detector is a reasonable first step (more
> >languages, faster, better accuracy), after which we can evaluate the need
> >to make it pluggable.
> >
> >There were some code & resource packaging issues with the original
> >project, but the fork I've been trying out seems much better.
> >
> >See https://github.com/optimaize/language-detector
> >
> >Still ALv2, and already in the Maven central repo.
> >
> >-- Ken
> >
> >> From: Mattmann, Chris A (3980)
> >> Sent: July 28, 2015 5:30:00pm PDT
> >> To: dev@tika.apache.org
> >> Subject: Bayesian N-Gram Language Detection
> >>
> >> FYI the code is ALv2:
> >>
> >> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
> >>
> >>
> >> I’m going to test this out and see how it compares with our own.
> >> Maybe we need to make the Language Detector pluggable too.
> >>
> >> ++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++
> >>
> >>
> >
> >--
> >Ken Krugler
> >+1 530-210-6378
> >http://www.scaleunlimited.com
> >custom big data solutions & training
> >Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
>
>


Re: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Oleg Tikhonov
Hi,
All basic tests are passed.
java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Linux/Ubuntu x86_64
Superb !!!
[x] +1 Release this package as Apache Tika 1.9

Thanks,
Oleg

On Tue, Jun 9, 2015 at 2:12 PM, Sergey Beryozkin 
wrote:

> +1
>
> Cheers, Sergey
>
>
>
>>> On Mon, Jun 8, 2015 at 1:11 PM Allison, Timothy B. 
>>> wrote:
>>>
>>>  +1

 Built in Windows and Linux.  Works on problems (that I caused!) in rc1.

 Let's make sure to include "last Java 1.6" version in the release notes,
 if that's what we've decided.

 Thank you, Chris!

 Best,

 Tim


 -Original Message-
 From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Saturday, June 06, 2015 9:47 PM
 To: dev@tika.apache.org
 Cc: u...@tika.apache.org
 Subject: [VOTE] Release Apache Tika 1.9 Candidate #2

 Hi Folks,

 A second candidate for the Tika 1.9 release is available at:

https://dist.apache.org/repos/dist/dev/tika/

 The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/

 The SHA1 checksum of the archive is
 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c.

 In addition, a staged maven repository is available here:
 https://repository.apache.org/content/repositories/orgapachetika-1011/


 Please vote on releasing this package as Apache Tika 1.9.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.9
 [ ] -1 Do not release this package because…

 Cheers,
 Chris

 P.S. Of course here is my +1.


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




>>>
>>
>>
>>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>
> Blog: http://sberyozkin.blogspot.com
>


Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-15 Thread Oleg Tikhonov
Hi Tyler,

good job, indeed !!!

[x] +1 Release this package as Apache Tika 1.8

On Wed, Apr 15, 2015 at 8:22 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Thanks Tyler! +1 from me:
>
> SIGS, checksums check out:
>
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/stage_apache_rc
> tika 1.8-src https://dist.apache.org/repos/dist/dev/tika/
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100 69.2M  100 69.2M0 0  1524k  0  0:00:46  0:00:46 --:--:--
> 1661k
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100   473  100   4730 0874  0 --:--:-- --:--:-- --:--:--
>  874
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 10033  100330 0 62  0 --:--:-- --:--:-- --:--:--
>   62
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/stage_apache_rc
> tika-app 1.8 https://dist.apache.org/repos/dist/dev/tika/
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100 44.0M  100 44.0M0 0  1742k  0  0:00:25  0:00:25 --:--:--
> 1825k
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100   473  100   4730 0922  0 --:--:-- --:--:-- --:--:--
>  922
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 10033  100330 0 63  0 --:--:-- --:--:-- --:--:--
>   63
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/stage_apache_rc
> tika-server 1.8 https://dist.apache.org/repos/dist/dev/tika/
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100 48.3M  100 48.3M0 0  1379k  0  0:00:35  0:00:35 --:--:--
> 1569k
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100   473  100   4730 0891  0 --:--:-- --:--:-- --:--:--
>  892
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 10033  100330 0 62  0 --:--:-- --:--:-- --:--:--
>   62
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%
>
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann% $HOME/bin/verify_gpg_sigs
>
> Verifying Signature for file tika-1.8-src.zip.asc
>
> gpg: Signature made Mon Apr 13 13:46:39 2015 EDT using RSA key ID D4F10117
>
> gpg: Good signature from "Tyler Palsulich "
>
> gpg: WARNING: This key is not certified with a trusted signature!
>
> gpg:  There is no indication that the signature belongs to the
> owner.
>
> Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117
>
> Verifying Signature for file tika-app-1.8.jar.asc
>
> gpg: Signature made Mon Apr 13 13:43:13 2015 EDT using RSA key ID D4F10117
>
> gpg: Good signature from "Tyler Palsulich "
>
> gpg: WARNING: This key is not certified with a trusted signature!
>
> gpg:  There is no indication that the signature belongs to the
> owner.
>
> Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117
>
> Verifying Signature for file tika-server-1.8.jar.asc
>
> gpg: Signature made Mon Apr 13 13:45:00 2015 EDT using RSA key ID D4F10117
>
> gpg: Good signature from "Tyler Palsulich "
>
> gpg: WARNING: This key is not certified with a trusted signature!
>
> gpg:  There is no indication that the signature belongs to the
> owner.
>
> Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%
> $HOME/bin/verify_md5_checksums
>
> md5sum: stat '*.tar.gz': No such file or directory
>
> md5sum: stat '*.bz2': No such file or directory
>
> md5sum: stat '*.tgz': No such file or directory
>
> tika-1.8-src.zip: OK
>
> [chipotle:~/tmp/apache-tika-1.8-rc2] mattmann%
>
> Cheers!
>
> Chris
>
> 
> From: Tyler Palsulich [tpalsul...@apache.org]
> Sent: Monday, April 13, 2015 10:56 AM
> To: dev@tika.apache.org; u...@tika.apache.org
> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>
> Hi Folks,
>
> A candidate for the Tika 1.8 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> T

Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-07 Thread Oleg Tikhonov
Hi,
[x] +1 Release this package as Apache Tika 1.8.

Tested on: Ubuntu 14.10, x86_64. Java 1.7 (Oracle)
Don't we want to update the following dependencies:
biz.aQute:bndlib . 1.43.0 -> 2.0.0.20130123-133441
org.apache.felix:org.apache.felix.scr.annotations  1.6.0 -> 1.9.10
org.osgi:org.osgi.compendium .. 4.0.0 -> 5.0.0
org.osgi:org.osgi.core  4.0.0 -> 6.0.0
com.drewnoakes:metadata-extractor . 2.7.2 -> 2.8.0
com.google.guava:guava  10.0.1 -> 18.0
edu.ucar:grib  4.5.5 -> 8.0.29
org.ow2.asm:asm-debug-all ... 4.1 -> 5.0.3
commons-io:commons-io . 2.1 -> 2.4
javax.mail:mail ... 1.4.4 -> 1.5.0-b01
org.apache.cxf:cxf-rt-frontend-jaxrs .. 2.7.8 -> 3.0.4

BR,
Oleg




On Wed, Apr 8, 2015 at 2:55 AM, Tyler Palsulich 
wrote:

> CC'ing user@tika for visibility.
>
> Tyler
>
> On Tue, Apr 7, 2015 at 4:54 PM, Tyler Palsulich 
> wrote:
>
> > Hi Folks,
> >
> > A candidate for the Tika 1.8 release is available at:
> >   https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >   http://svn.apache.org/repos/asf/tika/tags/1.8-rc1/
> >
> > The SHA1 checksum of the archive is
> >   ddeb3b43ca1c1ef346658a7005434019507e096f.
> >
> > In addition, a staged maven repository is available here:
> >   https://repository.apache.org/content/repositories/orgapachetika-1008
> >
> > Please vote on releasing this package as Apache Tika 1.8.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.8
> > [ ] -1 Do not release this package because...
> >
> > Have a good night!
> > Tyler
> >
>


Re: FW: Any interest in running Apache Tika as part of CommonCrawl?

2015-04-03 Thread Oleg Tikhonov
I Tim,
Having looked at CC, a couple of ideas crossed the mind. I think it's cool.
+1.

BR,
Oleg
On 3 Apr 2015 17:29, "Allison, Timothy B."  wrote:

> All,
>
> What do you think?
>
>
> https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0
>
>
> On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com talliso...@gmail.com> wrote:
> CommonCrawl currently has the WET format that extracts plain text from web
> pages.  My guess is that this is text stripping from text-y formats.  Let
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or
> supplementing the current WET by using Tika to extract contents from binary
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on
> TIKA-1302 on a Rackspace
> vm.  But, I'm wondering now if it would make more sense to have CommonCrawl
> run Tika as part of its regular process and make the output available in
> one of your standard formats.
>
> CommonCrawl consumers would get Tika output, and the Tika dev community
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces
> to help prioritize bug fixes.
>
> Cheers,
>
>   Tim
>


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-29 Thread Oleg Tikhonov
+1 for 1.8 release.
On 29 Mar 2015 02:04, "Konstantin Gribov"  wrote:

> Also, I think, we should resolve TIKA-1575 (upgrade to pdfbox 1.8.9) since
> pdfbox 1.8.8 hangs on some pdf forms.
>
> --
> Best regards,
> Konstantin Gribov
>
> сб, 28 марта 2015 г. в 23:22, Konstantin Gribov :
>
> > +1 to releasing 1.8.
> >
> > --
> > Best regards,
> > Konstantin Gribov
> >
> > сб, 28 марта 2015, 22:25, Tyler Palsulich :
> >
> > I'm also leaning toward 1.8. Especially given the newly identified
> >> regression in TIKA-1584.
> >>
> >> Tyler
> >> On Mar 28, 2015 11:47 AM, "Mattmann, Chris A (3980)" <
> >> chris.a.mattm...@jpl.nasa.gov> wrote:
> >>
> >> > Hi Tyler - I would VOTE for 1.8. Given the stuff associated
> >> > with releasing (updating the website; sending emails; waiting
> >> > periods, etc.) let’s ship all the updates we have too along
> >> > with the jhighlight fix.
> >> >
> >> > Cheers,
> >> > Chris
> >> >
> >> > ++
> >> > Chris Mattmann, Ph.D.
> >> > Chief Architect
> >> > Instrument Software and Science Data Systems Section (398)
> >> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> > Office: 168-519, Mailstop: 168-527
> >> > Email: chris.a.mattm...@nasa.gov
> >> > WWW:  http://sunset.usc.edu/~mattmann/
> >> > ++
> >> > Adjunct Associate Professor, Computer Science Department
> >> > University of Southern California, Los Angeles, CA 90089 USA
> >> > ++
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > -Original Message-
> >> > From: Tyler Palsulich 
> >> > Reply-To: "dev@tika.apache.org" 
> >> > Date: Saturday, March 28, 2015 at 8:01 AM
> >> > To: "dev@tika.apache.org" 
> >> > Subject: [DISCUSS] Tika 1.8 or 1.7.1
> >> >
> >> > >Hi Folks,
> >> > >
> >> > >Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
> >> to
> >> > >release a new version of Tika. I'll volunteer to be the release
> manager
> >> > >again.
> >> > >
> >> > >Should we release this as 1.8 or 1.7.1?
> >> > >
> >> > >Does anyone have any last minute issues they'd like to finish and see
> >> in
> >> > >Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
> and
> >> > >TIKA-1586). Any others?
> >> > >
> >> > >Have a good weekend,
> >> > >Tyler
> >> >
> >> >
> >>
> >
>


Re: trunk test failure

2015-03-26 Thread Oleg Tikhonov
Hi Chris,
just to confirm:

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent . SUCCESS [
9.268 s]
[INFO] Apache Tika core ... SUCCESS [
25.823 s]
[INFO] Apache Tika parsers  SUCCESS [02:41
min]
[INFO] Apache Tika XMP  SUCCESS [
1.986 s]
[INFO] Apache Tika serialization .. SUCCESS [
1.604 s]
[INFO] Apache Tika batch .. SUCCESS [02:02
min]
[INFO] Apache Tika application  SUCCESS [
18.983 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [
29.087 s]
[INFO] Apache Tika server . SUCCESS [
46.706 s]
[INFO] Apache Tika translate .. SUCCESS [
9.163 s]
[INFO] Apache Tika examples ... SUCCESS [
4.134 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [
1.236 s]
[INFO] Apache Tika  SUCCESS [
0.017 s]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 07:20 min
[INFO] Finished at: 2015-03-26T09:18:46+02:00
[INFO] Final Memory: 91M/848M
[INFO]



BR,
OLeg

On Thu, Mar 26, 2015 at 1:21 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> OK I am nuts - I was applying the patch from TIKA-1580, but didn’t
> update Felix in the bundle pom - done now, building again. Yay.
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , Chris Mattmann 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, March 25, 2015 at 6:57 PM
> To: "dev@tika.apache.org" 
> Subject: trunk test failure
>
> >Hey Anyone else seeing this failure in trunk?
> >
> >Running org.apache.tika.bundle.BundleIT
> >[main] INFO org.ops4j.pax.exam.spi.DefaultExamSystem - Pax Exam System
> >(Version: 4.4.0) created.
> >[main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - creating PaxExam
> >runner for class org.apache.tika.bundle.BundleIT
> >[main] INFO org.ops4j.pax.exam.junit.impl.ProbeRunner - running test class
> >org.apache.tika.bundle.BundleIT
> >ERROR: Bundle org.apache.tika.bundle [17] Error starting
> >file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundle.
> >j
> >ar (org.osgi.framework.BundleException: Unresolved constraint in bundle
> >org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
> >[17.0] osgi.wiring.package;
> >(&(osgi.wiring.package=org.apache.commons.csv)(version>=1.0.0)(!(version>=
> >2
> >.0.0
> >org.osgi.framework.BundleException: Unresolved constraint in bundle
> >org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
> >[17.0] osgi.wiring.package;
> >(&(osgi.wiring.package=org.apache.commons.csv)(version>=1.0.0)(!(version>=
> >2
> >.0.0)))
> >   at
> >org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:4097)
> >   at org.apache.felix.framework.Felix.startBundle(Felix.java:2114)
> >   at
> org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1368)
> >   at
> >org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevel
> >I
> >mpl.java:308)
> >   at java.lang.Thread.run(Thread.java:745)
> >[main] ERROR org.ops4j.pax.exam.nat.internal.NativeTestContainer - Bundle
> >[org.apache.tika.bundle [17]] is not resolved
> >ERROR: Bundle org.apache.tika.bundle [17] Error starting
> >file:/Users/mattmann/tmp/tika/tika-bundle/target/test-bundles/tika-bundle.
> >j
> >ar (org.osgi.framework.BundleException: Unresolved constraint in bundle
> >org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
> >[17.0] osgi.wiring.package;
> >(&(osgi.wiring.package=org.apache.commons.csv)(version>=1.0.0)(!(version>=
> >2
> >.0.0
> >org.osgi.framework.BundleException: Unresolved constraint in bundle
> >org.apache.tika.bundle [17]: Unable to resolve 17.0: missing requirement
> >[17.0] osgi.wiring.package;
> >(&(osgi.wiring.package=org.apache.commons.csv)(version>=1.0.0)(!(version>=
> >2
> >.0.0)))
> >   at
> >org.apache.felix.framewo

Re: [jira] [Closed] (TIKA-993) Language Detection Fault

2015-03-03 Thread Oleg Tikhonov
The first found. In this case will be German. Expexted result - a topic to
discuss. I would expect to get both detected languages. However it is
beyond tika's lang.dect.

Bottom line, so be it as is until Ken's implementation.
On 3 Mar 2015 09:09, "Tyler Palsulich"  wrote:

> Hi,
>
> What do you mean, the detection is faulty? What is the expected result in
> that case?
>
> Thanks,
> Tyler
> On Mar 3, 2015 1:10 AM, "Oleg Tikhonov"  wrote:
>
> > Hi,
> > Just for the record ...
> > It can happen if a file contains context that at least written in two
> > different languages. For instance, the first half of file, say, is a
> German
> > and the second one, say ... a French. In such case detection would be
> > faulty.
> >
> > Br,
> > Oleg
> > On 3 Mar 2015 04:03, "Tyler Palsulich (JIRA)"  wrote:
> >
> > >
> > >  [
> > >
> >
> https://issues.apache.org/jira/browse/TIKA-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > ]
> > >
> > > Tyler Palsulich closed TIKA-993.
> > > 
> > > Resolution: Cannot Reproduce
> > >
> > > This issue is >2 years old and has no attachment for the text. So, I'm
> > > closing as Cannot Reproduce. If you still have the text, please reopen!
> > >
> > > > Language Detection Fault
> > > > 
> > > >
> > > > Key: TIKA-993
> > > > URL: https://issues.apache.org/jira/browse/TIKA-993
> > > > Project: Tika
> > > >  Issue Type: Bug
> > > >  Components: languageidentifier
> > > >Reporter: Iman Reihanian
> > > > Attachments: DetectorImpl.java
> > > >
> > > >
> > > > This text's language is English but it detects as Italy.
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v6.3.4#6332)
> > >
> >
>


  1   2   3   >