[VOTE] Release Apache Tika 2.4.0 Candidate #1
A candidate for the Tika 2.4.0 release is available at: https://dist.apache.org/repos/dist/dev/tika/2.4.0 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/2.4.0-rc1/ The SHA-512 checksum of the archive is aff68637527fa4fa1ec21678ef2771a1dcd5eb3944bc1b1171c59459274295b903e093dc63ade0b6532bf137834d32bcb9cdf0d6a32efca187b9d6b8ac64f690. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1085/org/apache/tika Please vote on releasing this package as Apache Tika 2.4.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 2.4.0 [ ] -1 Do not release this package because... Here's my +1 Best, Tim
[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529667#comment-17529667 ] Dan Coldrick commented on TIKA-3742: [~nick] I've made a start today which I can share at some point tomorrow (been to the pub tonight lol so will have to wait till tomorrow ), are you ok if I lean on you 2 for help? I'd rather write something myself which you can rip apart so I can learn something. I've learnt a lot in the last week or so already :) I also think there is some meta data in there somewhere which we should be able to pull out :) > Advice around DGN7 parser and whether to add to TIKA > > > Key: TIKA-3742 > URL: https://issues.apache.org/jira/browse/TIKA-3742 > Project: Tika > Issue Type: Task > Components: parser >Reporter: Dan Coldrick >Priority: Minor > Attachments: DGN.zip, ExampleOutput.txt > > > Hi [~tallison] & Whoever else. > I managed to compile the C/C++ library [http://dgnlib.maptools.org/] for > DGN7 which produces an dgndump.exe which will dump all the data from the DGN. > From my initial testing it looks pretty good. > Would you guys think it was worth adding this or just keep it as a custom > parser rather than in the main source code? It's under MIT license. I've > attached the exe (zipped), a copy of the output from the dump and my very > dirty testing calling the exe (my code I was only interested in the Strings > so am only pulling those into a string array at the moment to check it's > pulling out the correct data). -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: Next releases WAS: Re: 2.4.0 release?
https://repository.apache.org is having a bad day. Requests are timing out left and right. I'll try to perform the release of 2.4.0-rc1 later today or tomorrow when the repo is happier. On Thu, Apr 28, 2022 at 9:47 AM Tim Allison wrote: > > I've upgraded junrar in both branches, and the regression results look good. > > I'll start 1.28.2-rc2 shortly, and then follow up with 2.4.0-rc1 if > there aren't any objections. > > On Tue, Apr 26, 2022 at 9:10 AM Tim Allison wrote: > > > > All, > > > > I'm prepping rc1 for 1.28.2 now. > > > > I'm running the regression tests for 2.4.0, and I hope to have results > > today with possibly an rc later today or early tomorrow if there are > > no surprises. > > > > Please let me know if there are any blockers. > > > > Best, > > > > Tim > > > > On Thu, Apr 7, 2022 at 9:50 AM Tim Allison wrote: > > > > > > All, > > > Once the new PDFBox is out, we should probably kick off the 2.4.0 > > > release. If I'm release manager, given my schedule, that'll probably > > > be the week of April 18th. > > > I want to fix TIKA-3711 (embedded file names), but other than that, > > > I don't think there are any blockers. > > > > > > WDYT? > > > > > > Best, > > > > > > Tim > > > > > > -- Forwarded message - > > > From: Andreas Lehmkuehler > > > Date: Thu, Apr 7, 2022 at 1:41 AM > > > Subject: 2.0.26 release > > > To: > > > > > > > > > Hi, > > > > > > sorry for the delay. I'm planning to cut the 2.0.26 release next > > > Saturday, the > > > day after tomorrow, if nobody objects. > > > > > > Andreas > > > > > > P.S.: I'm targeting a new 3.0.0 alpha release once the 2.0.26 release is > > > out > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (TIKA-3743) github actions -- we should install
[ https://issues.apache.org/jira/browse/TIKA-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529551#comment-17529551 ] Hudson commented on TIKA-3743: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #533 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/533/]) TIKA-3743 -- install (tallison: [https://github.com/apache/tika/commit/7d3911eceb87162947bd77a56250cc5532e38fb8]) * (edit) .github/workflows/main-jdk11-build.yml * (edit) .github/workflows/branch_1x-jdk11-build.yml * (edit) .github/workflows/branch_1x-jdk8-build.yml * (edit) .github/workflows/main-jdk17-build.yml * (edit) .github/workflows/main-jdk8-build.yml > github actions -- we should install > --- > > Key: TIKA-3743 > URL: https://issues.apache.org/jira/browse/TIKA-3743 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: Screenshot from 2022-04-28 11-39-16.png > > > We're calling {{mvn clean javadoc:aggregate test}}. This requires github to > pull dependencies from the snapshot repo. We should add {{install}} so that > the builds use the dependencies that were just built. -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2
+1 Tilman Am 28.04.2022 um 16:54 schrieb Tim Allison: A candidate for the Tika 1.28.2 release is available at: https://dist.apache.org/repos/dist/dev/tika/1.28.2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.28.2-rc2/ The SHA-512 checksum of the archive is 035f3643a302e2a88f99ca549c4d5c5c6eecd7736d03e4a686b17028f519f6a7a40229e48f2aac0bdf1653391e0bd7d34d0c7d099a2e5a2cb6141df00a4181bf. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1083/org/apache/tika Please vote on releasing this package as Apache Tika 1.28.2. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.28.2 [ ] -1 Do not release this package because... Here's my +1. Best, Tim
[jira] [Commented] (TIKA-3743) github actions -- we should install
[ https://issues.apache.org/jira/browse/TIKA-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529503#comment-17529503 ] Tim Allison commented on TIKA-3743: --- Hahahahaha. That didn't work. {noformat} [INFO] Error: Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:pom:2.4.1-SNAPSHOT: Could not find artifact org.apache.tika:tika-core:jar:tests:2.4.1-SNAPSHOT in apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1] {noformat} > github actions -- we should install > --- > > Key: TIKA-3743 > URL: https://issues.apache.org/jira/browse/TIKA-3743 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: Screenshot from 2022-04-28 11-39-16.png > > > We're calling {{mvn clean javadoc:aggregate test}}. This requires github to > pull dependencies from the snapshot repo. We should add {{install}} so that > the builds use the dependencies that were just built. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (TIKA-3743) github actions -- we should install
[ https://issues.apache.org/jira/browse/TIKA-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3743: -- Attachment: Screenshot from 2022-04-28 11-39-16.png > github actions -- we should install > --- > > Key: TIKA-3743 > URL: https://issues.apache.org/jira/browse/TIKA-3743 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: Screenshot from 2022-04-28 11-39-16.png > > > We're calling {{mvn clean javadoc:aggregate test}}. This requires github to > pull dependencies from the snapshot repo. We should add {{install}} so that > the builds use the dependencies that were just built. -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: How to deal with the recursive content in Tika 2
Great, will give it a try asap Cheers, Serget On Thu, Apr 28, 2022 at 4:22 PM Tim Allison wrote: > Give this a try: > > https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java#L60 > > On Thu, Apr 28, 2022 at 11:12 AM Sergey Beryozkin > wrote: > > > > Hi Tim, All > > > > We have a pending issue in Quarkus Tika to upgrade to Tika 2. > > One of the problems is that according to a user's comment the recursive > > content is treated somehow differently in Tika2, specifically, this code: > > > > > https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95 > > > > attempts to get a collection of the parsed outer and embedded documents > by > > accessing them as > > > > metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT); > > > > What is the equivalent way to achieve the same with Tika 2 ? > > > > Thanks, Sergey >
[jira] [Created] (TIKA-3743) github actions -- we should install
Tim Allison created TIKA-3743: - Summary: github actions -- we should install Key: TIKA-3743 URL: https://issues.apache.org/jira/browse/TIKA-3743 Project: Tika Issue Type: Improvement Reporter: Tim Allison We're calling {{mvn clean javadoc:aggregate test}}. This requires github to pull dependencies from the snapshot repo. We should add {{install}} so that the builds use the dependencies that were just built. -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: How to deal with the recursive content in Tika 2
Give this a try: https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java#L60 On Thu, Apr 28, 2022 at 11:12 AM Sergey Beryozkin wrote: > > Hi Tim, All > > We have a pending issue in Quarkus Tika to upgrade to Tika 2. > One of the problems is that according to a user's comment the recursive > content is treated somehow differently in Tika2, specifically, this code: > > https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95 > > attempts to get a collection of the parsed outer and embedded documents by > accessing them as > > metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT); > > What is the equivalent way to achieve the same with Tika 2 ? > > Thanks, Sergey
How to deal with the recursive content in Tika 2
Hi Tim, All We have a pending issue in Quarkus Tika to upgrade to Tika 2. One of the problems is that according to a user's comment the recursive content is treated somehow differently in Tika2, specifically, this code: https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95 attempts to get a collection of the parsed outer and embedded documents by accessing them as metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT); What is the equivalent way to achieve the same with Tika 2 ? Thanks, Sergey
[jira] [Commented] (TIKA-3740) Update junrar > 7.5.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529482#comment-17529482 ] Hudson commented on TIKA-3740: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #531 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/531/]) TIKA-3740 -- upgrade junrar (tallison: [https://github.com/apache/tika/commit/403b7aef24c2cfaa77e7069fc341a91b1d948c49]) * (edit) tika-parent/pom.xml > Update junrar > 7.5.0 when available > > > Key: TIKA-3740 > URL: https://issues.apache.org/jira/browse/TIKA-3740 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 1.28.2, 2.4.0 > > > Many thanks to [~tilman] for identifying this regression as we were prepping > for our 1.28.2 release. > I've opened: https://github.com/junrar/junrar/issues/86 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[VOTE] Release Apache Tika 1.28.2 Candidate #2
A candidate for the Tika 1.28.2 release is available at: https://dist.apache.org/repos/dist/dev/tika/1.28.2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.28.2-rc2/ The SHA-512 checksum of the archive is 035f3643a302e2a88f99ca549c4d5c5c6eecd7736d03e4a686b17028f519f6a7a40229e48f2aac0bdf1653391e0bd7d34d0c7d099a2e5a2cb6141df00a4181bf. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1083/org/apache/tika Please vote on releasing this package as Apache Tika 1.28.2. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.28.2 [ ] -1 Do not release this package because... Here's my +1. Best, Tim
[jira] [Commented] (TIKA-3740) Update junrar > 7.5.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529468#comment-17529468 ] Hudson commented on TIKA-3740: -- UNSTABLE: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #193 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/193/]) TIKA-3740 -- upgrade junrar (tallison: [https://github.com/apache/tika/commit/c322ec6cdee98c34d050ef6d20db43e9eec80b75]) * (edit) tika-parsers/pom.xml > Update junrar > 7.5.0 when available > > > Key: TIKA-3740 > URL: https://issues.apache.org/jira/browse/TIKA-3740 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 1.28.2, 2.4.0 > > > Many thanks to [~tilman] for identifying this regression as we were prepping > for our 1.28.2 release. > I've opened: https://github.com/junrar/junrar/issues/86 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3571) Add an interface for rendering engines
[ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529465#comment-17529465 ] Tim Allison commented on TIKA-3571: --- The other thing we need to account for is multiple renderings per page. I'd rather not add this complexity from the beginning, but the API should be able to handle this. > Add an interface for rendering engines > -- > > Key: TIKA-3571 > URL: https://issues.apache.org/jira/browse/TIKA-3571 > Project: Tika > Issue Type: Wish >Reporter: Tim Allison >Priority: Major > > We've now seen a few requests for extracting text _and_ rendering PDFs, and > certainly it might be useful to have alternatives for rendering files (e.g. > this [Alfresco > study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]), > including MSOffice or at least PPTx... > And there are cases where users don't want the rendered images, but they do > want OCR to be run against the rendered images. > I doubt I'll have a chance to work on this for a while, but I wanted to open > an issue for discussion. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529459#comment-17529459 ] Tim Allison commented on TIKA-3742: --- [~nick] your gist looks great! [~monkmachine], I'm passing the baton to you on this one. In general, please use readFully and skipFully and ensure that the parse stops if the file is truncated -- check every read for EOF. > Advice around DGN7 parser and whether to add to TIKA > > > Key: TIKA-3742 > URL: https://issues.apache.org/jira/browse/TIKA-3742 > Project: Tika > Issue Type: Task > Components: parser >Reporter: Dan Coldrick >Priority: Minor > Attachments: DGN.zip, ExampleOutput.txt > > > Hi [~tallison] & Whoever else. > I managed to compile the C/C++ library [http://dgnlib.maptools.org/] for > DGN7 which produces an dgndump.exe which will dump all the data from the DGN. > From my initial testing it looks pretty good. > Would you guys think it was worth adding this or just keep it as a custom > parser rather than in the main source code? It's under MIT license. I've > attached the exe (zipped), a copy of the output from the dump and my very > dirty testing calling the exe (my code I was only interested in the Strings > so am only pulling those into a string array at the moment to check it's > pulling out the correct data). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (TIKA-3740) Update junrar > 7.5.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3740. --- Fix Version/s: 1.28.2 2.4.0 Resolution: Fixed Many thanks to [~gotson] and colleagues on junrar for a blazingly fast fix and release! > Update junrar > 7.5.0 when available > > > Key: TIKA-3740 > URL: https://issues.apache.org/jira/browse/TIKA-3740 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Fix For: 1.28.2, 2.4.0 > > > Many thanks to [~tilman] for identifying this regression as we were prepping > for our 1.28.2 release. > I've opened: https://github.com/junrar/junrar/issues/86 -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: Next releases WAS: Re: 2.4.0 release?
I've upgraded junrar in both branches, and the regression results look good. I'll start 1.28.2-rc2 shortly, and then follow up with 2.4.0-rc1 if there aren't any objections. On Tue, Apr 26, 2022 at 9:10 AM Tim Allison wrote: > > All, > > I'm prepping rc1 for 1.28.2 now. > > I'm running the regression tests for 2.4.0, and I hope to have results > today with possibly an rc later today or early tomorrow if there are > no surprises. > > Please let me know if there are any blockers. > > Best, > > Tim > > On Thu, Apr 7, 2022 at 9:50 AM Tim Allison wrote: > > > > All, > > Once the new PDFBox is out, we should probably kick off the 2.4.0 > > release. If I'm release manager, given my schedule, that'll probably > > be the week of April 18th. > > I want to fix TIKA-3711 (embedded file names), but other than that, > > I don't think there are any blockers. > > > > WDYT? > > > > Best, > > > > Tim > > > > -- Forwarded message - > > From: Andreas Lehmkuehler > > Date: Thu, Apr 7, 2022 at 1:41 AM > > Subject: 2.0.26 release > > To: > > > > > > Hi, > > > > sorry for the delay. I'm planning to cut the 2.0.26 release next Saturday, > > the > > day after tomorrow, if nobody objects. > > > > Andreas > > > > P.S.: I'm targeting a new 3.0.0 alpha release once the 2.0.26 release is out > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (TIKA-3740) Update junrar > 7.5.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529444#comment-17529444 ] Tim Allison edited comment on TIKA-3740 at 4/28/22 1:43 PM: Regression results on 1.x branch on full set of rar files look good. These compare 1.28.1 with 1.28.2-SNAPSHOT with 7.5.1. https://corpora.tika.apache.org/base/reports/tika-1.28.2-rar-reports.tgz was (Author: talli...@mitre.org): Regression results on 1.x branch on full set of rar files looks good. These compare 1.28.1 with 1.28.2-SNAPSHOT with 7.5.1. https://corpora.tika.apache.org/base/reports/tika-1.28.2-rar-reports.tgz > Update junrar > 7.5.0 when available > > > Key: TIKA-3740 > URL: https://issues.apache.org/jira/browse/TIKA-3740 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > Many thanks to [~tilman] for identifying this regression as we were prepping > for our 1.28.2 release. > I've opened: https://github.com/junrar/junrar/issues/86 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3740) Update junrar > 7.5.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529444#comment-17529444 ] Tim Allison commented on TIKA-3740: --- Regression results on 1.x branch on full set of rar files looks good. These compare 1.28.1 with 1.28.2-SNAPSHOT with 7.5.1. https://corpora.tika.apache.org/base/reports/tika-1.28.2-rar-reports.tgz > Update junrar > 7.5.0 when available > > > Key: TIKA-3740 > URL: https://issues.apache.org/jira/browse/TIKA-3740 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > > Many thanks to [~tilman] for identifying this regression as we were prepping > for our 1.28.2 release. > I've opened: https://github.com/junrar/junrar/issues/86 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529431#comment-17529431 ] Tim Allison commented on TIKA-3742: --- IOUtils.readFully()? > Advice around DGN7 parser and whether to add to TIKA > > > Key: TIKA-3742 > URL: https://issues.apache.org/jira/browse/TIKA-3742 > Project: Tika > Issue Type: Task > Components: parser >Reporter: Dan Coldrick >Priority: Minor > Attachments: DGN.zip, ExampleOutput.txt > > > Hi [~tallison] & Whoever else. > I managed to compile the C/C++ library [http://dgnlib.maptools.org/] for > DGN7 which produces an dgndump.exe which will dump all the data from the DGN. > From my initial testing it looks pretty good. > Would you guys think it was worth adding this or just keep it as a custom > parser rather than in the main source code? It's under MIT license. I've > attached the exe (zipped), a copy of the output from the dump and my very > dirty testing calling the exe (my code I was only interested in the Strings > so am only pulling those into a string array at the moment to check it's > pulling out the correct data). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529417#comment-17529417 ] Nick Burch commented on TIKA-3742: -- I believe {{readNBytes}} only came in with Java 9, and the particular {{readNBytes(int)}} in Java 11, so you'll need to use a newer JVM. Should be able to replace it with Commons IO calls once we're happy with the general logic + approach > Advice around DGN7 parser and whether to add to TIKA > > > Key: TIKA-3742 > URL: https://issues.apache.org/jira/browse/TIKA-3742 > Project: Tika > Issue Type: Task > Components: parser >Reporter: Dan Coldrick >Priority: Minor > Attachments: DGN.zip, ExampleOutput.txt > > > Hi [~tallison] & Whoever else. > I managed to compile the C/C++ library [http://dgnlib.maptools.org/] for > DGN7 which produces an dgndump.exe which will dump all the data from the DGN. > From my initial testing it looks pretty good. > Would you guys think it was worth adding this or just keep it as a custom > parser rather than in the main source code? It's under MIT license. I've > attached the exe (zipped), a copy of the output from the dump and my very > dirty testing calling the exe (my code I was only interested in the Strings > so am only pulling those into a string array at the moment to check it's > pulling out the correct data). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA
[ https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529409#comment-17529409 ] Dan Coldrick commented on TIKA-3742: [~nick] I can have a go although I can't get the following line to compile in eclipse: byte[] str = is.readNBytes(len); > Advice around DGN7 parser and whether to add to TIKA > > > Key: TIKA-3742 > URL: https://issues.apache.org/jira/browse/TIKA-3742 > Project: Tika > Issue Type: Task > Components: parser >Reporter: Dan Coldrick >Priority: Minor > Attachments: DGN.zip, ExampleOutput.txt > > > Hi [~tallison] & Whoever else. > I managed to compile the C/C++ library [http://dgnlib.maptools.org/] for > DGN7 which produces an dgndump.exe which will dump all the data from the DGN. > From my initial testing it looks pretty good. > Would you guys think it was worth adding this or just keep it as a custom > parser rather than in the main source code? It's under MIT license. I've > attached the exe (zipped), a copy of the output from the dump and my very > dirty testing calling the exe (my code I was only interested in the Strings > so am only pulling those into a string array at the moment to check it's > pulling out the correct data). -- This message was sent by Atlassian Jira (v8.20.7#820007)
Re: 1.28.2 regression results
Tilman, Thank you for looking carefully at the reports! > commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH 1Sonig is what we're getting in 2.3.0 and in the 2.4.0-soon-to-be-candidate, and it looks correct based on the underlying xml and when I open it in LibreOffice. It looks like it was incorrectly put in a different cell or at least incorrectly separated by a tab in 1.28.1. >"file not fully read from stream" This is a new exception in branch_1x because we made the ICNS parser more strict than it was (https://github.com/apache/tika/commit/ab709a5299be867c0e603116491faaa6546ed889#diff-6a7cb1f54ca026509b1eed5dabc7556d7e67fdfc2e68737d82f7e10f2550069a). Note that the files are ~1MB, which means they are likely CommonCrawlTruncated(TM). I confirmed that they are truncated. This exception is the behavior in the 2.x branch. On Thu, Apr 28, 2022 at 2:31 AM Tilman Hausherr wrote: > > Am 28.04.2022 um 00:25 schrieb Tim Allison: > > Are available here: > > https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz > > > > I haven't taken a look yet. > > > > Let me know if you find anything. > > > commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH > > this is minor and is related to superscript, I don't know if this is > wanted or not. > > The two "file not fully read from stream" exceptions, am I correct to > assume that these are problems in the batch itself? > > Tilman >
Re: 1.28.2 regression results
Am 28.04.2022 um 00:25 schrieb Tim Allison: Are available here: https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz I haven't taken a look yet. Let me know if you find anything. commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH this is minor and is related to superscript, I don't know if this is wanted or not. The two "file not fully read from stream" exceptions, am I correct to assume that these are problems in the batch itself? Tilman