[VOTE] Release Apache Tika 2.4.0 Candidate #1

2022-04-28 Thread Tim Allison
A candidate for the Tika 2.4.0 release is available at:
https://dist.apache.org/repos/dist/dev/tika/2.4.0

The release candidate is a zip archive of the sources in:
https://github.com/apache/tika/tree/2.4.0-rc1/

The SHA-512 checksum of the archive is
aff68637527fa4fa1ec21678ef2771a1dcd5eb3944bc1b1171c59459274295b903e093dc63ade0b6532bf137834d32bcb9cdf0d6a32efca187b9d6b8ac64f690.

In addition, a staged maven repository is available here:
https://repository.apache.org/content/repositories/orgapachetika-1085/org/apache/tika

Please vote on releasing this package as Apache Tika 2.4.0.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 2.4.0
[ ] -1 Do not release this package because...

Here's my +1

Best,

  Tim


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529667#comment-17529667
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick]  I've made a start today which I can share at some point tomorrow (been 
to the pub tonight lol so will have to wait till tomorrow ), are you ok if I 
lean on you 2 for help? I'd rather write something myself which you can rip 
apart so I can learn something. I've learnt a lot in the last week or so 
already :)

 

I also think there is some meta data in there somewhere which we should be able 
to pull out :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Next releases WAS: Re: 2.4.0 release?

2022-04-28 Thread Tim Allison
https://repository.apache.org is having a bad day.  Requests are
timing out left and right.  I'll try to perform the release of
2.4.0-rc1 later today or tomorrow when the repo is happier.

On Thu, Apr 28, 2022 at 9:47 AM Tim Allison  wrote:
>
> I've upgraded junrar in both branches, and the regression results look good.
>
> I'll start 1.28.2-rc2 shortly, and then follow up with 2.4.0-rc1 if
> there aren't any objections.
>
> On Tue, Apr 26, 2022 at 9:10 AM Tim Allison  wrote:
> >
> > All,
> >
> > I'm prepping rc1 for 1.28.2 now.
> >
> > I'm running the regression tests for 2.4.0, and I hope to have results
> > today with possibly an rc later today or early tomorrow if there are
> > no surprises.
> >
> > Please let me know if there are any blockers.
> >
> > Best,
> >
> > Tim
> >
> > On Thu, Apr 7, 2022 at 9:50 AM Tim Allison  wrote:
> > >
> > > All,
> > >   Once the new PDFBox is out, we should probably kick off the 2.4.0
> > > release.  If I'm release manager, given my schedule, that'll probably
> > > be the week of April 18th.
> > >   I want to fix TIKA-3711 (embedded file names), but other than that,
> > > I don't think there are any blockers.
> > >
> > >   WDYT?
> > >
> > >  Best,
> > >
> > >  Tim
> > >
> > > -- Forwarded message -
> > > From: Andreas Lehmkuehler 
> > > Date: Thu, Apr 7, 2022 at 1:41 AM
> > > Subject: 2.0.26 release
> > > To: 
> > >
> > >
> > > Hi,
> > >
> > > sorry for the delay.  I'm planning to cut the 2.0.26 release next 
> > > Saturday, the
> > > day after tomorrow, if nobody objects.
> > >
> > > Andreas
> > >
> > > P.S.: I'm targeting a new 3.0.0 alpha release once the 2.0.26 release is 
> > > out
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > > For additional commands, e-mail: dev-h...@pdfbox.apache.org


[jira] [Commented] (TIKA-3743) github actions -- we should install

2022-04-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529551#comment-17529551
 ] 

Hudson commented on TIKA-3743:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #533 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/533/])
TIKA-3743 -- install (tallison: 
[https://github.com/apache/tika/commit/7d3911eceb87162947bd77a56250cc5532e38fb8])
* (edit) .github/workflows/main-jdk11-build.yml
* (edit) .github/workflows/branch_1x-jdk11-build.yml
* (edit) .github/workflows/branch_1x-jdk8-build.yml
* (edit) .github/workflows/main-jdk17-build.yml
* (edit) .github/workflows/main-jdk8-build.yml


> github actions -- we should install
> ---
>
> Key: TIKA-3743
> URL: https://issues.apache.org/jira/browse/TIKA-3743
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: Screenshot from 2022-04-28 11-39-16.png
>
>
> We're calling {{mvn clean javadoc:aggregate test}}.  This requires github to 
> pull dependencies from the snapshot repo.  We should add {{install}} so that 
> the builds use the dependencies that were just built.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-04-28 Thread Tilman Hausherr

+1

Tilman

Am 28.04.2022 um 16:54 schrieb Tim Allison:

A candidate for the Tika 1.28.2 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/1.28.2

The release candidate is a zip archive of the sources in:
   https://github.com/apache/tika/tree/1.28.2-rc2/

The SHA-512 checksum of the archive is
   
035f3643a302e2a88f99ca549c4d5c5c6eecd7736d03e4a686b17028f519f6a7a40229e48f2aac0bdf1653391e0bd7d34d0c7d099a2e5a2cb6141df00a4181bf.

In addition, a staged maven repository is available here:
   
https://repository.apache.org/content/repositories/orgapachetika-1083/org/apache/tika

Please vote on releasing this package as Apache Tika 1.28.2.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.28.2
[ ] -1 Do not release this package because...


Here's my +1.

Best,

Tim





[jira] [Commented] (TIKA-3743) github actions -- we should install

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529503#comment-17529503
 ] 

Tim Allison commented on TIKA-3743:
---

Hahahahaha.  That didn't work.

{noformat}
[INFO] 
Error:  Failed to execute goal on project tika-parsers: Could not resolve 
dependencies for project org.apache.tika:tika-parsers:pom:2.4.1-SNAPSHOT: Could 
not find artifact org.apache.tika:tika-core:jar:tests:2.4.1-SNAPSHOT in 
apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1]
{noformat}

> github actions -- we should install
> ---
>
> Key: TIKA-3743
> URL: https://issues.apache.org/jira/browse/TIKA-3743
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: Screenshot from 2022-04-28 11-39-16.png
>
>
> We're calling {{mvn clean javadoc:aggregate test}}.  This requires github to 
> pull dependencies from the snapshot repo.  We should add {{install}} so that 
> the builds use the dependencies that were just built.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3743) github actions -- we should install

2022-04-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3743:
--
Attachment: Screenshot from 2022-04-28 11-39-16.png

> github actions -- we should install
> ---
>
> Key: TIKA-3743
> URL: https://issues.apache.org/jira/browse/TIKA-3743
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: Screenshot from 2022-04-28 11-39-16.png
>
>
> We're calling {{mvn clean javadoc:aggregate test}}.  This requires github to 
> pull dependencies from the snapshot repo.  We should add {{install}} so that 
> the builds use the dependencies that were just built.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: How to deal with the recursive content in Tika 2

2022-04-28 Thread Sergey Beryozkin
Great, will give it a try asap

Cheers, Serget

On Thu, Apr 28, 2022 at 4:22 PM Tim Allison  wrote:

> Give this a try:
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java#L60
>
> On Thu, Apr 28, 2022 at 11:12 AM Sergey Beryozkin 
> wrote:
> >
> > Hi Tim, All
> >
> > We have a pending issue in Quarkus Tika to upgrade to Tika 2.
> > One of the problems is that according to a user's comment the recursive
> > content is treated somehow differently in Tika2, specifically, this code:
> >
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95
> >
> > attempts to get a collection of the parsed outer and embedded documents
> by
> > accessing them as
> >
> > metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);
> >
> > What is the equivalent way to achieve the same with Tika 2 ?
> >
> > Thanks, Sergey
>


[jira] [Created] (TIKA-3743) github actions -- we should install

2022-04-28 Thread Tim Allison (Jira)
Tim Allison created TIKA-3743:
-

 Summary: github actions -- we should install
 Key: TIKA-3743
 URL: https://issues.apache.org/jira/browse/TIKA-3743
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


We're calling {{mvn clean javadoc:aggregate test}}.  This requires github to 
pull dependencies from the snapshot repo.  We should add {{install}} so that 
the builds use the dependencies that were just built.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: How to deal with the recursive content in Tika 2

2022-04-28 Thread Tim Allison
Give this a try:
https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java#L60

On Thu, Apr 28, 2022 at 11:12 AM Sergey Beryozkin  wrote:
>
> Hi Tim, All
>
> We have a pending issue in Quarkus Tika to upgrade to Tika 2.
> One of the problems is that according to a user's comment the recursive
> content is treated somehow differently in Tika2, specifically, this code:
>
> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95
>
> attempts to get a collection of the parsed outer and embedded documents by
> accessing them as
>
> metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);
>
> What is the equivalent way to achieve the same with Tika 2 ?
>
> Thanks, Sergey


How to deal with the recursive content in Tika 2

2022-04-28 Thread Sergey Beryozkin
Hi Tim, All

We have a pending issue in Quarkus Tika to upgrade to Tika 2.
One of the problems is that according to a user's comment the recursive
content is treated somehow differently in Tika2, specifically, this code:

https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95

attempts to get a collection of the parsed outer and embedded documents by
accessing them as

metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);

What is the equivalent way to achieve the same with Tika 2 ?

Thanks, Sergey


[jira] [Commented] (TIKA-3740) Update junrar > 7.5.0 when available

2022-04-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529482#comment-17529482
 ] 

Hudson commented on TIKA-3740:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #531 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/531/])
TIKA-3740 -- upgrade junrar (tallison: 
[https://github.com/apache/tika/commit/403b7aef24c2cfaa77e7069fc341a91b1d948c49])
* (edit) tika-parent/pom.xml


> Update junrar > 7.5.0 when available
> 
>
> Key: TIKA-3740
> URL: https://issues.apache.org/jira/browse/TIKA-3740
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.28.2, 2.4.0
>
>
> Many thanks to [~tilman] for identifying this regression as we were prepping 
> for our 1.28.2 release.
> I've opened: https://github.com/junrar/junrar/issues/86



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-04-28 Thread Tim Allison
A candidate for the Tika 1.28.2 release is available at:
  https://dist.apache.org/repos/dist/dev/tika/1.28.2

The release candidate is a zip archive of the sources in:
  https://github.com/apache/tika/tree/1.28.2-rc2/

The SHA-512 checksum of the archive is
  
035f3643a302e2a88f99ca549c4d5c5c6eecd7736d03e4a686b17028f519f6a7a40229e48f2aac0bdf1653391e0bd7d34d0c7d099a2e5a2cb6141df00a4181bf.

In addition, a staged maven repository is available here:
  
https://repository.apache.org/content/repositories/orgapachetika-1083/org/apache/tika

Please vote on releasing this package as Apache Tika 1.28.2.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.28.2
[ ] -1 Do not release this package because...


Here's my +1.

Best,

   Tim


[jira] [Commented] (TIKA-3740) Update junrar > 7.5.0 when available

2022-04-28 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529468#comment-17529468
 ] 

Hudson commented on TIKA-3740:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #193 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/193/])
TIKA-3740 -- upgrade junrar (tallison: 
[https://github.com/apache/tika/commit/c322ec6cdee98c34d050ef6d20db43e9eec80b75])
* (edit) tika-parsers/pom.xml


> Update junrar > 7.5.0 when available
> 
>
> Key: TIKA-3740
> URL: https://issues.apache.org/jira/browse/TIKA-3740
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.28.2, 2.4.0
>
>
> Many thanks to [~tilman] for identifying this regression as we were prepping 
> for our 1.28.2 release.
> I've opened: https://github.com/junrar/junrar/issues/86



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529465#comment-17529465
 ] 

Tim Allison commented on TIKA-3571:
---

The other thing we need to account for is multiple renderings per page.  I'd 
rather not add this complexity from the beginning, but the API should be able 
to handle this.

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529459#comment-17529459
 ] 

Tim Allison commented on TIKA-3742:
---

[~nick] your gist looks great!  [~monkmachine], I'm passing the baton to you on 
this one.  In general, please use readFully and skipFully and ensure that the 
parse stops if the file is truncated -- check every read for EOF.

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3740) Update junrar > 7.5.0 when available

2022-04-28 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3740.
---
Fix Version/s: 1.28.2
   2.4.0
   Resolution: Fixed

Many thanks to [~gotson] and colleagues on junrar for a blazingly fast fix and 
release!

> Update junrar > 7.5.0 when available
> 
>
> Key: TIKA-3740
> URL: https://issues.apache.org/jira/browse/TIKA-3740
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.28.2, 2.4.0
>
>
> Many thanks to [~tilman] for identifying this regression as we were prepping 
> for our 1.28.2 release.
> I've opened: https://github.com/junrar/junrar/issues/86



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Next releases WAS: Re: 2.4.0 release?

2022-04-28 Thread Tim Allison
I've upgraded junrar in both branches, and the regression results look good.

I'll start 1.28.2-rc2 shortly, and then follow up with 2.4.0-rc1 if
there aren't any objections.

On Tue, Apr 26, 2022 at 9:10 AM Tim Allison  wrote:
>
> All,
>
> I'm prepping rc1 for 1.28.2 now.
>
> I'm running the regression tests for 2.4.0, and I hope to have results
> today with possibly an rc later today or early tomorrow if there are
> no surprises.
>
> Please let me know if there are any blockers.
>
> Best,
>
> Tim
>
> On Thu, Apr 7, 2022 at 9:50 AM Tim Allison  wrote:
> >
> > All,
> >   Once the new PDFBox is out, we should probably kick off the 2.4.0
> > release.  If I'm release manager, given my schedule, that'll probably
> > be the week of April 18th.
> >   I want to fix TIKA-3711 (embedded file names), but other than that,
> > I don't think there are any blockers.
> >
> >   WDYT?
> >
> >  Best,
> >
> >  Tim
> >
> > -- Forwarded message -
> > From: Andreas Lehmkuehler 
> > Date: Thu, Apr 7, 2022 at 1:41 AM
> > Subject: 2.0.26 release
> > To: 
> >
> >
> > Hi,
> >
> > sorry for the delay.  I'm planning to cut the 2.0.26 release next Saturday, 
> > the
> > day after tomorrow, if nobody objects.
> >
> > Andreas
> >
> > P.S.: I'm targeting a new 3.0.0 alpha release once the 2.0.26 release is out
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org


[jira] [Comment Edited] (TIKA-3740) Update junrar > 7.5.0 when available

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529444#comment-17529444
 ] 

Tim Allison edited comment on TIKA-3740 at 4/28/22 1:43 PM:


Regression results on 1.x branch on full set of rar files look good.  These 
compare 1.28.1 with 1.28.2-SNAPSHOT with 7.5.1.

https://corpora.tika.apache.org/base/reports/tika-1.28.2-rar-reports.tgz


was (Author: talli...@mitre.org):
Regression results on 1.x branch on full set of rar files looks good.  These 
compare 1.28.1 with 1.28.2-SNAPSHOT with 7.5.1.

https://corpora.tika.apache.org/base/reports/tika-1.28.2-rar-reports.tgz

> Update junrar > 7.5.0 when available
> 
>
> Key: TIKA-3740
> URL: https://issues.apache.org/jira/browse/TIKA-3740
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> Many thanks to [~tilman] for identifying this regression as we were prepping 
> for our 1.28.2 release.
> I've opened: https://github.com/junrar/junrar/issues/86



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3740) Update junrar > 7.5.0 when available

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529444#comment-17529444
 ] 

Tim Allison commented on TIKA-3740:
---

Regression results on 1.x branch on full set of rar files looks good.  These 
compare 1.28.1 with 1.28.2-SNAPSHOT with 7.5.1.

https://corpora.tika.apache.org/base/reports/tika-1.28.2-rar-reports.tgz

> Update junrar > 7.5.0 when available
> 
>
> Key: TIKA-3740
> URL: https://issues.apache.org/jira/browse/TIKA-3740
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> Many thanks to [~tilman] for identifying this regression as we were prepping 
> for our 1.28.2 release.
> I've opened: https://github.com/junrar/junrar/issues/86



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529431#comment-17529431
 ] 

Tim Allison commented on TIKA-3742:
---

IOUtils.readFully()?

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529417#comment-17529417
 ] 

Nick Burch commented on TIKA-3742:
--

I believe {{readNBytes}} only came in with Java 9, and the particular 
{{readNBytes(int)}} in Java 11, so you'll need to use a newer JVM. Should be 
able to replace it with Commons IO calls once we're happy with the general 
logic + approach

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529409#comment-17529409
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick]  I can have a go although I can't get the following line to compile in 
eclipse:

byte[] str = is.readNBytes(len);

 

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: 1.28.2 regression results

2022-04-28 Thread Tim Allison
Tilman,
  Thank you for looking carefully at the reports!

> commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH
1Sonig is what we're getting in 2.3.0 and in the
2.4.0-soon-to-be-candidate, and it looks correct based on the
underlying xml and when I open it in LibreOffice.  It looks like it
was incorrectly put in a different cell or at least incorrectly
separated by a tab in 1.28.1.

>"file not fully read from stream"
This is a new exception in branch_1x because we made the ICNS parser
more strict than it was
(https://github.com/apache/tika/commit/ab709a5299be867c0e603116491faaa6546ed889#diff-6a7cb1f54ca026509b1eed5dabc7556d7e67fdfc2e68737d82f7e10f2550069a).
Note that the files are ~1MB, which means they are likely
CommonCrawlTruncated(TM).  I confirmed that they are truncated.  This
exception is the behavior in the 2.x branch.



On Thu, Apr 28, 2022 at 2:31 AM Tilman Hausherr  wrote:
>
> Am 28.04.2022 um 00:25 schrieb Tim Allison:
> > Are available here:
> > https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz
> >
> > I haven't taken a look yet.
> >
> > Let me know if you find anything.
>
>
> commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH
>
> this is minor and is related to superscript, I don't know if this is
> wanted or not.
>
> The two "file not fully read from stream" exceptions, am I correct to
> assume that these are problems in the batch itself?
>
> Tilman
>


Re: 1.28.2 regression results

2022-04-28 Thread Tilman Hausherr

Am 28.04.2022 um 00:25 schrieb Tim Allison:

Are available here:
https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz

I haven't taken a look yet.

Let me know if you find anything.



commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH

this is minor and is related to superscript, I don't know if this is 
wanted or not.


The two "file not fully read from stream" exceptions, am I correct to 
assume that these are problems in the batch itself?


Tilman