Re: [VOTE] Release Apache Tika 2.0.0-ALPHA Candidate #1
Here's my +1 On 1 15 2021, at 2:44, Tilman Hausherr wrote: > +1 > > Tilman > Am 14.01.2021 um 02:19 schrieb Tim Allison: > > All, > > > > A candidate for the Tika 2.0.0-ALPHA release is available at: > > https://dist.apache.org/repos/dist/dev/tika/ > > > > The release candidate is a zip archive of the sources in: > > https://github.com/apache/tika/tree/2.0.0-ALPHA-rc1/ > > > > The SHA-512 checksum of the archive is > > > > ae018f4384d2cd63281422cc82ec71a5b6f5d64ac29b343d714737e6b35fee6e5d0190cd065bf069948eadeeea831c5d74a6da6a554f049d3075f40eeb984f13. > > > > In addition, a staged maven repository is available here: > > > > https://repository.apache.org/content/repositories/orgapachetika-1065/org/apache/tika > > > > Please vote on releasing this package as Apache Tika 2.0.0-ALPHA. > > The vote is open for the next 72 hours and passes if a majority of at > > least three +1 Tika PMC votes are cast. > > > > Note: there may be still breaking changes before the formal release of > > 2.0.0. > > > > Here's my +1. > > > > Best, > > > > Tim > > > > [ ] +1 Release this package as Apache Tika 2.0.0-ALPHA > > [ ] -1 Do not release this package because... > > >
[jira] [Commented] (TIKA-3180) Tika 2.0.0 -- Modularize tika-server
[ https://issues.apache.org/jira/browse/TIKA-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252556#comment-17252556 ] Peter Lee commented on TIKA-3180: - It works now. :) > Tika 2.0.0 -- Modularize tika-server > > > Key: TIKA-3180 > URL: https://issues.apache.org/jira/browse/TIKA-3180 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Labels: 2.0.0 > Fix For: 2.0.0 > > > What do fellow devs think about having a tika-server-core which would be just > the server/wrapper code without any dependencies on {{tika-parsers}} and then > a tika-server with the usual dependency on {{tika-parsers}}. > As we move to more modularity, I'd think some users might want the server > {{tika-server-core}}, but then maybe only include a subset of the parsers. > WDYT? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3180) Tika 2.0.0 -- Modularize tika-server
[ https://issues.apache.org/jira/browse/TIKA-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251612#comment-17251612 ] Peter Lee commented on TIKA-3180: - Seems some tests are failed, see [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/org.apache.tika$tika-server-core/lastBuild/console] Looks like some test files are missed in the last commit. :) Some of the files could be found in git log : tika-server/tika-server-core/src/test/resources/test-documents/mock/fake_oom.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/heavy_hang_100.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/heavy_hang_3.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/null_pointer.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/real_oom.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/system_exit.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/testStaticStdOutErr.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/testStdOutErr.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/thread_interrupt.xml The others look like new added files : tika-server/tika-server-core/src/test/resources/test-documents/mock/hello_world.xml tika-server/tika-server-core/src/test/resources/test-documents/mock/encrypted_document_exception.xml Not sure why Jenkins only report an unstable build result. Considering some tests are failed, it should be a failed build test. > Tika 2.0.0 -- Modularize tika-server > > > Key: TIKA-3180 > URL: https://issues.apache.org/jira/browse/TIKA-3180 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Labels: 2.0.0 > > What do fellow devs think about having a tika-server-core which would be just > the server/wrapper code without any dependencies on {{tika-parsers}} and then > a tika-server with the usual dependency on {{tika-parsers}}. > As we move to more modularity, I'd think some users might want the server > {{tika-server-core}}, but then maybe only include a subset of the parsers. > WDYT? -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: accidental merge
> That one (0810700) I wanted to commit. > I see. Everything looks good now. :) Lee On 12 14 2020, at 4:35, Tilman Hausherr wrote: > Am 14.12.2020 um 08:48 schrieb Peter Lee: > > Seems the latest commit 7f65d61 is exactly the same as dd85c73: > > https://github.com/apache/tika/compare/dd85c73..7f65d61 > > (https://github.com/apache/tika/compare/3571725..7f65d61) > > https://github.com/apache/tika/compare/3571725..7f65d61 > > which means the commit 0810700 is not reverted yet. > > > That one (0810700) I wanted to commit. > My mistake is that I hadn't pulled all the changes before making the > commit. Things then got bad. > > The next commit was apparently made implicitly after I had pushed the > changes. I thought that I had reverted your changes but now I suspect > that it had repeated your changes, but maybe an earlier point in (my?) > history. > > Tilman > > The 10 lines commits is not a big change. Maybe just modify them without > > using git revert is a good idea? > > BTW we can use "git pull --rebase" to update local repo, then we can avoid > > the merge commit > > cheers, > > Lee > > On 12 14 2020, at 2:08, Tilman Hausherr wrote: > >> I think I got it now. > >> > >> Someone please verify this: > >> the last good commit is from Peter Lee "Simplify init code of some Set > >> and List". > >> then I made a small commit "TIKA-3248: avoid ClassCastException" of > >> about 10 lines. > >> > >> then "bad" things happened. > >> Ideally, the last three commits on top shouldn't have happened. > >> Tilman > >> Am 14.12.2020 um 06:51 schrieb Tilman Hausherr: > >>> Hi all, > >>> > >>> I made an accidental merge and I tried to revert it, but I suspect I > >>> made it worse. Still working on it... > >>> > >>> Tilman > >>> > > >
Re: accidental merge
Seems the latest commit 7f65d61 is exactly the same as dd85c73: https://github.com/apache/tika/compare/dd85c73..7f65d61 (https://github.com/apache/tika/compare/3571725..7f65d61) https://github.com/apache/tika/compare/3571725..7f65d61 which means the commit 0810700 is not reverted yet. The 10 lines commits is not a big change. Maybe just modify them without using git revert is a good idea? BTW we can use "git pull --rebase" to update local repo, then we can avoid the merge commit cheers, Lee On 12 14 2020, at 2:08, Tilman Hausherr wrote: > I think I got it now. > > Someone please verify this: > the last good commit is from Peter Lee "Simplify init code of some Set > and List". > then I made a small commit "TIKA-3248: avoid ClassCastException" of > about 10 lines. > > then "bad" things happened. > Ideally, the last three commits on top shouldn't have happened. > Tilman > Am 14.12.2020 um 06:51 schrieb Tilman Hausherr: > > Hi all, > > > > I made an accidental merge and I tried to revert it, but I suspect I > > made it worse. Still working on it... > > > > Tilman > > >
[jira] [Resolved] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
[ https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Lee resolved TIKA-3218. - Fix Version/s: 2.0.0 Resolution: Fixed > Wrong comment for method sortLoadedClasses in ServiceLoaderUtils > > > Key: TIKA-3218 > URL: https://issues.apache.org/jira/browse/TIKA-3218 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 > Reporter: Peter Lee >Priority: Minor > Fix For: 2.0.0 > > > > Here is method sortLoadedClasses 's comment: > > {code:java} > /** >* Sorts a list of loaded classes, so that non-Tika ones come >* before Tika ones, and otherwise in reverse alphabetical order >*/ > {code} > But you will find the method do the opposite thing if you check the code . > See [1] > Also , If you run this test , you can see the Tika's class come before > non-Tika' class in the sorted list. > > {code:java} > @Test > public void test() { > List list = new ArrayList<>(); > list.add(new Object()); > list.add(new TikaException("abcd")); > ServiceLoaderUtils.sortLoadedClasses(list); > assertEquals(list.get(0).getClass().getName(), > "org.apache.tika.exception.TikaException"); > assertEquals(list.get(1).getClass().getName(), "java.lang.Object"); > } > {code} > > > I think the code is right and we need to modify the comment. > > [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
[ https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244377#comment-17244377 ] Peter Lee commented on TIKA-3218: - Thank you for fix this (y) > Wrong comment for method sortLoadedClasses in ServiceLoaderUtils > > > Key: TIKA-3218 > URL: https://issues.apache.org/jira/browse/TIKA-3218 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 > Reporter: Peter Lee >Priority: Minor > > > Here is method sortLoadedClasses 's comment: > > {code:java} > /** >* Sorts a list of loaded classes, so that non-Tika ones come >* before Tika ones, and otherwise in reverse alphabetical order >*/ > {code} > But you will find the method do the opposite thing if you check the code . > See [1] > Also , If you run this test , you can see the Tika's class come before > non-Tika' class in the sorted list. > > {code:java} > @Test > public void test() { > List list = new ArrayList<>(); > list.add(new Object()); > list.add(new TikaException("abcd")); > ServiceLoaderUtils.sortLoadedClasses(list); > assertEquals(list.get(0).getClass().getName(), > "org.apache.tika.exception.TikaException"); > assertEquals(list.get(1).getClass().getName(), "java.lang.Object"); > } > {code} > > > I think the code is right and we need to modify the comment. > > [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer
Many thanks to you, Tim. :) Hi, all I'm Peter Lee and I was a Apache Commons committer. I'm familiar with many archivers and compressors. Feel free to ask me if you have some problems in compression. I'm honored to be part of Tika. Tika is great and it helped me a lot. Besides, Tika is a great community and it has helped a lot of users. I hope I can help Tika a little bit. Once again, thank you all for making such a great community! cheers, Lee On 11 25 2020, at 9:27, Tim Allison wrote: > All, > > The Tika PMC has elected to add Peter Lee to our ranks. > Lee, > Please introduce yourself, and welcome aboard! > > Cheers, > Tim
Re: branch_1x tika-bundle issues
Got the same problem. After some investigation I believe it's caused by the version of maven-bundle-plugin : I can successfully build branch_1x with version 4.1.0, but failed with version 4.2.0, 4.2.1 and 5.1.1 Still working on finding out what's wrong here. Here this helps. cheers, Lee On 11 18 2020, at 4:18 , Tim Allison wrote: > Is anyone else having problems building branch_1x? I'm having problems on > ubuntu with openjdk8 -- I'm getting this error when the build tries to test > the tika-bundle module: > > org.apache.tika.bundle > org.osgi.framework.BundleException: Unable to resolve > org.apache.tika.bundle [19](R 19.0): missing requirement > [org.apache.tika.bundle [19](R 19.0)] osgi.wiring.package; > (&(osgi.wiring.package=org.apache.tika.config)(version>=1.25.0)(!(version>=2.0.0))) > [caused by: Unable to resolve org.apache.tika.core [13](R 13.0): missing > requirement [org.apache.tika.core [13](R 13.0)] osgi.wiring.package; > (osgi.wiring.package=org.apache.xerces.util)] Unresolved requirements: > [[org.apache.tika.bundle [19](R 19.0)] osgi.wiring.package; > (&(osgi.wiring.package=org.apache.tika.config)(version>=1.25.0)(!(version>=2.0.0)))] > > Any recommendations? > Thank you! > Cheers, > Tim
[jira] [Commented] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
[ https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227108#comment-17227108 ] Peter Lee commented on TIKA-3218: - _so that user-provided ones would come first and would be able to override built-in Tika ones_ It seems the built-in Tika ones come first - you can check this via my test. > Wrong comment for method sortLoadedClasses in ServiceLoaderUtils > > > Key: TIKA-3218 > URL: https://issues.apache.org/jira/browse/TIKA-3218 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 2.0.0 > Reporter: Peter Lee >Priority: Minor > > > Here is method sortLoadedClasses 's comment: > > {code:java} > /** >* Sorts a list of loaded classes, so that non-Tika ones come >* before Tika ones, and otherwise in reverse alphabetical order >*/ > {code} > But you will find the method do the opposite thing if you check the code . > See [1] > Also , If you run this test , you can see the Tika's class come before > non-Tika' class in the sorted list. > > {code:java} > @Test > public void test() { > List list = new ArrayList<>(); > list.add(new Object()); > list.add(new TikaException("abcd")); > ServiceLoaderUtils.sortLoadedClasses(list); > assertEquals(list.get(0).getClass().getName(), > "org.apache.tika.exception.TikaException"); > assertEquals(list.get(1).getClass().getName(), "java.lang.Object"); > } > {code} > > > I think the code is right and we need to modify the comment. > > [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
Peter Lee created TIKA-3218: --- Summary: Wrong comment for method sortLoadedClasses in ServiceLoaderUtils Key: TIKA-3218 URL: https://issues.apache.org/jira/browse/TIKA-3218 Project: Tika Issue Type: Bug Components: core Affects Versions: 2.0.0 Reporter: Peter Lee Here is method sortLoadedClasses 's comment: {code:java} /** * Sorts a list of loaded classes, so that non-Tika ones come * before Tika ones, and otherwise in reverse alphabetical order */ {code} But you will find the method do the opposite thing if you check the code . See [1] Also , If you run this test , you can see the Tika's class come before non-Tika' class in the sorted list. {code:java} @Test public void test() { List list = new ArrayList<>(); list.add(new Object()); list.add(new TikaException("abcd")); ServiceLoaderUtils.sortLoadedClasses(list); assertEquals(list.get(0).getClass().getName(), "org.apache.tika.exception.TikaException"); assertEquals(list.get(1).getClass().getName(), "java.lang.Object"); } {code} I think the code is right and we need to modify the comment. [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3213) Consider migrating universalcharsetdetector to a live fork
[ https://issues.apache.org/jira/browse/TIKA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220030#comment-17220030 ] Peter Lee commented on TIKA-3213: - This fork repository don't support Chinese charset detect since version 2.0.0. See this issue : [https://github.com/albfernandez/juniversalchardet/issues/34] It might be a problem. > Consider migrating universalcharsetdetector to a live fork > -- > > Key: TIKA-3213 > URL: https://issues.apache.org/jira/browse/TIKA-3213 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I just came across this living fork of the aged juniversalchardet (2011!!!): > https://github.com/albfernandez/juniversalchardet > It has a mozilla license, has decent star count and is published on maven > central. > Obv, we'll want to run a comparison on our corpus before making this change, > but I wanted to open this issue for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika
[ https://issues.apache.org/jira/browse/TIKA-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217212#comment-17217212 ] Peter Lee commented on TIKA-3209: - Hi [~nick] Just replace PicturesSource in Tika with PictureRunMapper in POI in my fork repo with commit 232643b. see [5] got these test failures: see[6] {code:java} Error: Failures: 3558Error:POIContainerExtractionTest.testEmbeddedImages:90 expected:<1> but was:<0> 3559Error:POIContainerExtractionTest.testEmbeddedStorageId:137 expected:<{F4754C9B-64F5-4B40-8AF4-679732AC0607}> but was: 3560Error:OOXMLContainerExtractionTest.testEmbeddedOfficeFiles:170 expected:<24> but was:<22> 3561Error:SXWPFExtractorTest.testEmbedded:761 expected:<16> but was:<15>{code} [5] [https://github.com/PeterAlfredLee/tika/commit/232643b27bdd7798f94b64931b5070d667f8dc29] [6] [https://github.com/PeterAlfredLee/tika/runs/1278568384?check_suite_focus=true] > Different between PictureRunMapper in POI and PicturesSource in Tika > > > Key: TIKA-3209 > URL: https://issues.apache.org/jira/browse/TIKA-3209 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Peter Lee >Priority: Minor > > 1. In git log of POI, class PictureRunMapper was copy from class > PicturesSource in Tika. see [1] > 2. This TODO of Tika suggest replace PicturesSource with PictureRunMapper > once POI 3.18 is out. see [2] > So I try to replace but got a test fail. > I think it may because the different between in method nextUnclaimed in these > two classes. see [3][4] > > Can we remove this line in POI ? see [4] > > [1] > [https://github.com/apache/poi/commit/bdb0e8199bb6891b068e97da69d6410870e8066b] > [2] > [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L641] > [3] > [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L709] > > [4] > [https://github.com/apache/poi/blob/f509d1deae86866ed531f10f2eba7db17e098473/src/scratchpad/src/org/apache/poi/hwpf/usermodel/PictureRunMapper.java#L130] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika
[ https://issues.apache.org/jira/browse/TIKA-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216391#comment-17216391 ] Peter Lee commented on TIKA-3209: - [~nick] Could you give some advice ? Can we remove that line in POI ? see [4] > Different between PictureRunMapper in POI and PicturesSource in Tika > > > Key: TIKA-3209 > URL: https://issues.apache.org/jira/browse/TIKA-3209 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Peter Lee >Priority: Minor > > 1. In git log of POI, class PictureRunMapper was copy from class > PicturesSource in Tika. see [1] > 2. This TODO of Tika suggest replace PicturesSource with PictureRunMapper > once POI 3.18 is out. see [2] > So I try to replace but got a test fail. > I think it may because the different between in method nextUnclaimed in these > two classes. see [3][4] > > Can we remove this line in POI ? see [4] > > [1] > [https://github.com/apache/poi/commit/bdb0e8199bb6891b068e97da69d6410870e8066b] > [2] > [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L641] > [3] > [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L709] > > [4] > [https://github.com/apache/poi/blob/f509d1deae86866ed531f10f2eba7db17e098473/src/scratchpad/src/org/apache/poi/hwpf/usermodel/PictureRunMapper.java#L130] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika
Peter Lee created TIKA-3209: --- Summary: Different between PictureRunMapper in POI and PicturesSource in Tika Key: TIKA-3209 URL: https://issues.apache.org/jira/browse/TIKA-3209 Project: Tika Issue Type: Bug Components: parser Reporter: Peter Lee 1. In git log of POI, class PictureRunMapper was copy from class PicturesSource in Tika. see [1] 2. This TODO of Tika suggest replace PicturesSource with PictureRunMapper once POI 3.18 is out. see [2] So I try to replace but got a test fail. I think it may because the different between in method nextUnclaimed in these two classes. see [3][4] Can we remove this line in POI ? see [4] [1] [https://github.com/apache/poi/commit/bdb0e8199bb6891b068e97da69d6410870e8066b] [2] [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L641] [3] [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L709] [4] [https://github.com/apache/poi/blob/f509d1deae86866ed531f10f2eba7db17e098473/src/scratchpad/src/org/apache/poi/hwpf/usermodel/PictureRunMapper.java#L130] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor
[ https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200448#comment-17200448 ] Peter Lee edited comment on TIKA-3196 at 9/23/20, 2:13 AM: --- Hi [~tallison] I wrote a test here : [https://github.com/apache/tika/pull/356#issuecomment-696721537] I forged a zip in memory that's small enough(~100 b) and we do not need to attach a zip archive to reproduce this. Hope this helps. was (Author: peterlee): Hi [~tallison] I wrote a test here : [https://github.com/apache/tika/pull/356#issuecomment-696721537] I forged a zip in memory that's small enough(~100 kb) and we do not need to attach a zip archive to reproduce this. Hope this helps. > PackageParser should attempt to parse entries from zip files with STORED > entries with data descriptor > - > > Key: TIKA-3196 > URL: https://issues.apache.org/jira/browse/TIKA-3196 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Trevor Bentley >Priority: Major > Attachments: OOO-107047-0.oxt-145.zip > > > We are currently using tika for text extraction. Currently some sites are > returning zips that have entries with stored data descriptors which fail to > extract due to the ZipArchiveInputStream (in commons-compress) defaulting to > false for 'allowStoredEntriesWithDataDescriptor'. > Since ZipArchiveInputStream has support for reading zips with data > descriptors we should attempt to read the zip with that feature enabled when > we get a data descriptor UnsupportedZipFeatureException. > Pull Request: > [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor
[ https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200448#comment-17200448 ] Peter Lee commented on TIKA-3196: - Hi [~tallison] I wrote a test here : [https://github.com/apache/tika/pull/356#issuecomment-696721537] I forged a zip in memory that's small enough(~100 kb) and we do not need to attach a zip archive to reproduce this. Hope this helps. > PackageParser should attempt to parse entries from zip files with STORED > entries with data descriptor > - > > Key: TIKA-3196 > URL: https://issues.apache.org/jira/browse/TIKA-3196 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Trevor Bentley >Priority: Major > Attachments: OOO-107047-0.oxt-145.zip > > > We are currently using tika for text extraction. Currently some sites are > returning zips that have entries with stored data descriptors which fail to > extract due to the ZipArchiveInputStream (in commons-compress) defaulting to > false for 'allowStoredEntriesWithDataDescriptor'. > Since ZipArchiveInputStream has support for reading zips with data > descriptors we should attempt to read the zip with that feature enabled when > we get a data descriptor UnsupportedZipFeatureException. > Pull Request: > [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (TIKA-3197) TikaInputStream may not be closed
[ https://issues.apache.org/jira/browse/TIKA-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Lee resolved TIKA-3197. - Resolution: Not A Problem > TikaInputStream may not be closed > - > > Key: TIKA-3197 > URL: https://issues.apache.org/jira/browse/TIKA-3197 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Peter Lee >Priority: Minor > > This is TikaInputStream's close method : > {code:java} > public void close() throws IOException { > path = null; > mark = -1; > tmp.addResource(in); > tmp.close(); > } > {code} > It will clean the TemporaryResources and close the InputStream which we > orginal get in parameter. > > This is TikaInputStream's get method : > {code:java} > public static TikaInputStream get(InputStream stream, TemporaryResources tmp) > { > if (stream == null) { > throw new NullPointerException("The Stream must not be null"); > } > if (stream instanceof TikaInputStream) { > return (TikaInputStream) stream; > } else { > // Make sure that the stream is buffered and that it > // (properly) supports the mark feature > if (!(stream.markSupported())) { > stream = new BufferedInputStream(stream); > } > return new TikaInputStream(stream, tmp, -1); > } > }{code} > If stream is not instance of TikaInputStream, *it will create and return a > new instance of TikaInputStream.* > And as you can see , *we will not close this new instance in close method in > this case*. We will only close the InputStream which we orginal get in > parameter. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3197) TikaInputStream may not be closed
Peter Lee created TIKA-3197: --- Summary: TikaInputStream may not be closed Key: TIKA-3197 URL: https://issues.apache.org/jira/browse/TIKA-3197 Project: Tika Issue Type: Bug Components: parser Reporter: Peter Lee This is TikaInputStream's close method : {code:java} public void close() throws IOException { path = null; mark = -1; tmp.addResource(in); tmp.close(); } {code} It will clean the TemporaryResources and close the InputStream which we orginal get in parameter. This is TikaInputStream's get method : {code:java} public static TikaInputStream get(InputStream stream, TemporaryResources tmp) { if (stream == null) { throw new NullPointerException("The Stream must not be null"); } if (stream instanceof TikaInputStream) { return (TikaInputStream) stream; } else { // Make sure that the stream is buffered and that it // (properly) supports the mark feature if (!(stream.markSupported())) { stream = new BufferedInputStream(stream); } return new TikaInputStream(stream, tmp, -1); } }{code} If stream is not instance of TikaInputStream, *it will create and return a new instance of TikaInputStream.* And as you can see , *we will not close this new instance in close method in this case*. We will only close the InputStream which we orginal get in parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: release planning?
Everything looks good enough from me. BTW I'm willing to participate in the developing of Tika 2.0. Any work that could assigned to me? Lee On 9 10 2020, at 12:31, Tim Allison wrote: > Hi Lee, > Thank you for those PRs. I merged them into main and > cherry-picked+resolved in branch_1x. Anything else? > > On Tue, Sep 8, 2020 at 9:56 PM Peter Lee wrote: > > Hi Tim, > > > > I pushed some bugfix PRs in github and maybe we could have a look if they > > should be merged into branch_1x : > > #330 : URLs update > > #340 : some minor fix TikaCLI > > #347 : minor fix for BatchProcessBuilder > > #353 : fix for tests failure for those developers whose default language > > is not English > > > > And I think these 2 improvements could also be considered : > > #332 : fix for tests failure in Windows caused by the failing delete of > > temp files > > #334 : add empty check for TikaConifg. > > > > BTW these PRs are pushed to the main branch. I can rebase and push them to > > branch_1x if needed. > > WDYT? > > cheers, > > Lee > > > > On 9 9 2020, at 3:38, Tim Allison wrote: > > > Hi All, > > > > > > What would you say to kicking off the release process for 1.25 in the > > > next couple of weeks? I want to fix TIKA-3186 before that release. > > > Anything else? > > > > > > For 2.0.0-ALPHA, I think once we get the OSGi bundles back in, we should > > > be good to go? Do we want to convert to gradle before release? I've > > > punted on multi-parsers. What else do we need before an alpha release? > > > > > > Cheers, > > > Tim >
Re: release planning?
Hi Tim, I pushed some bugfix PRs in github and maybe we could have a look if they should be merged into branch_1x : #330 : URLs update #340 : some minor fix TikaCLI #347 : minor fix for BatchProcessBuilder #353 : fix for tests failure for those developers whose default language is not English And I think these 2 improvements could also be considered : #332 : fix for tests failure in Windows caused by the failing delete of temp files #334 : add empty check for TikaConifg. BTW these PRs are pushed to the main branch. I can rebase and push them to branch_1x if needed. WDYT? cheers, Lee On 9 9 2020, at 3:38, Tim Allison wrote: > Hi All, > > What would you say to kicking off the release process for 1.25 in the > next couple of weeks? I want to fix TIKA-3186 before that release. > Anything else? > > For 2.0.0-ALPHA, I think once we get the OSGi bundles back in, we should > be good to go? Do we want to convert to gradle before release? I've > punted on multi-parsers. What else do we need before an alpha release? > > Cheers, > Tim
Re: Tests failed in windows but not in linux
Hi Bob, I think I have found out what's wrong. Seems there's a infinite loop. I have pushed a PR, please have a look at : https://github.com/apache/tika/pull/343 cheers, Lee On 8 24 2020, at 8:54 , Bob Paulin wrote: > > Hi Lee, > > I get the same error on windows with GeoParser and SentimentAnalysisParser on > the main branch. Removing the Logger fixes both and it builds cleanly. Still > not sure what the exact issue is but I can recreate the issue and your > solution. > - Bob > On 8/24/2020 4:02 AM, Peter Lee wrote: > > > > Update : > > > > It works after I removed the loggers in GeoParser and GeoParserConfig. But > > I'm still not clear what exactly the problem is. :( > > Lee > > On 8 24 2020, at 3:27 , Peter Lee > > (mailto:peter...@apache.org) wrote: > > > > > > Hi all, > > > > > > The tests are failing on my windows : the GeoParserTest are failing cause > > > the class org.apache.tika.parser.geo.GeoParser cloud not be found. But > > > everything works fine on my Ubuntu. > > > The error is wired. I did some googling but couldn't figure out what's > > > the problem. > > > Anyone who got same error in Windows? > > > Lee > > >
Re: Tests failed in windows but not in linux
Update : It works after I removed the loggers in GeoParser and GeoParserConfig. But I'm still not clear what exactly the problem is. :( Lee On 8 24 2020, at 3:27 , Peter Lee wrote: > Hi all, > > The tests are failing on my windows : the GeoParserTest are failing cause the > class org.apache.tika.parser.geo.GeoParser cloud not be found. But everything > works fine on my Ubuntu. > The error is wired. I did some googling but couldn't figure out what's the > problem. > Anyone who got same error in Windows? > Lee
Tests failed in windows but not in linux
Hi all, The tests are failing on my windows : the GeoParserTest are failing cause the class org.apache.tika.parser.geo.GeoParser cloud not be found. But everything works fine on my Ubuntu. The error is wired. I did some googling but couldn't figure out what's the problem. Anyone who got same error in Windows? Lee
Re: Windows build errors
Hi Tilman, > expected: but was: charset=[windows-1252]> I think this problem is caused by the charset detection strategy basing on line separator(CRLF or LF) and the git autocrlf config. I also met this problem and solved it like this : Set autocrlf false by git config --global core.autocrlf (https://link.getmailspring.com/link/1d55e0fb-1334-4857-ac3f-314524468...@getmailspring.com/0?redirect=core.autocrlf=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D) false Delete everything except for the .git directory Restore all the code by git reset --hard This works on me. Hope it helps. cheers, Lee On 8 20 2020, at 1:30 , Tilman Hausherr wrote: > After many weeks I checked out the "main" branch, and get these build > errors: > > Failures: > TestMimeTypes.testArchiveDetection:395->assertTypeByData:1275 > expected:<[application/x-archive]> but was:<[text/plain]> > AutoDetectParserTest.testHTML:192->assertAutoDetect:145->assertAutoDetect:128->assertAutoDetect:99 > Bad content type: Test parameters: > resourceRealName = /test-documents/testHTML.html > resourceStatedName = /test-documents/testHTML.html > realType = text/html; charset=ISO-8859-1 > statedType = null > expectedContentFragment = Test Indexation Html > expected: but was: charset=[windows-1252]> > AutoDetectParserTest.testText:224->assertAutoDetect:145->assertAutoDetect:128->assertAutoDetect:99 > Bad content type: Test parameters: > resourceRealName = /test-documents/testTXT.txt > resourceStatedName = /test-documents/testTXT.txt > realType = text/plain; charset=ISO-8859-1 > statedType = null > expectedContentFragment = indexation de Txt > expected: but was: charset=[windows-1252]> > TextAndCSVParserTest.testSubclassingMimeTypesRemain:217 > expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar; > charset=[windows-1252]> > ArParserTest.testArParsing:43 expected:<[application/x-archive]> but > was:<[text/plain; charset=windows-1252]> > ArParserTest.testEmbedded:75 expected:<1> but was:<0> > TXTParserTest.testSubclassingMimeTypesRemain:299 expected:<...-vcalendar; > charset=[ISO-8859-1]> but was:<...-vcalendar; charset=[windows-1252]> > Errors: > TestContainerAwareDetector.testBPList:571->assertTypeByData:66->assertTypeByNameAndData:76->assertTypeByNameAndData:96 > » ArrayIndexOutOfBounds > PListParserTest.testWebArchive:48->TikaTest.getRecursiveMetadata:232->TikaTest.getRecursiveMetadata:306 > » ArrayIndexOutOfBounds > PDFParserTest.testFileInAnnotationExtractedIfNoContents:1491->TikaTest.assertContains:112 > » NullPointer
[jira] [Commented] (TIKA-1770) AutoDetectParser wrongly detects plain text as images/audio
[ https://issues.apache.org/jira/browse/TIKA-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178184#comment-17178184 ] Peter Lee commented on TIKA-1770: - Test 3 given file in tika-1.24.1 . here is tika content-type detection result : ||File Name||Content Type|| |the-acl-rd-tec_chunk_15.txt|audio/mpeg| |the-acl-rd-tec_chunk_9113.txt|image/x-portable-bitmap| |the-acl-rd-tec_chunk_10228.txt|image/x-portable-bitmap| Reason: Content of file `the-acl-rd-tec_chunk_15.txt` start with string "ID3" which is magic byte of audio/mpeg. Content of file `the-acl-rd-tec_chunk_9113.txt` start with string "P1" which is magic byte of image/x-portable-bitmap. Content of file `the-acl-rd-tec_chunk_10228.txt` start with string "P4" which is magic byte of image/x-portable-bitmap. After google these two formats, I can't find the way to improve these formats magic byte match configure. Maybe we should setup a rule : some format must have both extendtion name and magic byte match. > AutoDetectParser wrongly detects plain text as images/audio > --- > > Key: TIKA-1770 > URL: https://issues.apache.org/jira/browse/TIKA-1770 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.10 > Environment: OS independent (tested on both Windows, MAC OS) >Reporter: Ziqi >Priority: Minor > Attachments: the-acl-rd-tec_chunk_10228.txt, > the-acl-rd-tec_chunk_15.txt, the-acl-rd-tec_chunk_9113.txt > > > AutoDetectParser fails to recognize certain plain-text files as plain text. > In the attachment are three testing files, as you can see they are all plain > text. > The following code is used for testing: > > AutoDetectParser parser = new AutoDetectParser(); > for (File f : new File("path").listFiles()) { > InputStream in = new BufferedInputStream(new > FileInputStream(f.toString())); > BodyContentHandler handler = new BodyContentHandler(-1); > Metadata metadata = new Metadata(); > try { > parser.parse(in, handler, metadata); > String content = handler.toString(); > System.out.println(metadata); //line A > }catch (Exception e){ > e.printStackTrace(); > } > } > > for the three testing files, line A prints the following: > X-Parsed-By=org.apache.tika.parser.EmptyParser > Content-Type=image/x-portable-bitmap > X-Parsed-By=org.apache.tika.parser.DefaultParser > X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 > Content-Type=audio/mpeg > X-Parsed-By=org.apache.tika.parser.EmptyParser > Content-Type=image/x-portable-bitmap > And as a result, variable "content" is always empty. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176206#comment-17176206 ] Peter Lee commented on TIKA-3155: - According to my understanding , here is how Tika handle csv file : 1. Try to parse with commons-csv first. 2. Parse the rest data in InputStream as plain text if encounter IllegalStateException. Unfortunately, in this case , commons-csv has consumed all data in InputStream before it throws IllegalStateException , so there is nothing left in InputStream and we can't parse . If we don't try to parse with commons-csv first then we don't know is it gonna to encounter IllegalStateException. But if we try and encounter IllegalStateException, there is nothing we can do because all data has been consumed. Maybe we can read all data form InputStream to a byte array, then we can try many way in many times. But this may cost a lot of memory size and I don't think is a smart way. Maybe it's beter if we change nothing. Just let user to adjust their csv file when encounter IllegalStateException. > Parse Error while extracting CSV files > -- > > Key: TIKA-3155 > URL: https://issues.apache.org/jira/browse/TIKA-3155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Priority: Major > Attachments: UTF-8_chars.csv > > > We are getting parse error while trying to extract csv files. > This was working in version 1.9, but exception coming in 1.24.1 > > {code:java} > /Exception in thread "main" org.apache.tika.exception.TikaException: > exception parsing the csv > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 > undefined) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 > undefined) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined) > Caused by: java.lang.IllegalStateException: IOException reading next record: > java.io.IOException: (startline 39) EOF reached before encapsulated token > finished > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 > undefined) > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 > undefined) > ... 6 more > Caused by: java.io.IOException: (startline 39) EOF reached before > encapsulated token finished > at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 > undefined) > at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined) > at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142 > undefined)/ > {code} > Issue is coming when we encounter double quotes in one of the cells. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175290#comment-17175290 ] Peter Lee commented on TIKA-3155: - We can do it in _TextAndCSVParser_ like this {code:java} CSVFormat csvFormat = CSVFormat.EXCEL.withDelimiter(params.getDelimiter()).withQuote(null); {code} I tested with Quote Mode off and it works. > Parse Error while extracting CSV files > -- > > Key: TIKA-3155 > URL: https://issues.apache.org/jira/browse/TIKA-3155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Priority: Major > Attachments: UTF-8_chars.csv > > > We are getting parse error while trying to extract csv files. > This was working in version 1.9, but exception coming in 1.24.1 > > {code:java} > /Exception in thread "main" org.apache.tika.exception.TikaException: > exception parsing the csv > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 > undefined) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 > undefined) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined) > Caused by: java.lang.IllegalStateException: IOException reading next record: > java.io.IOException: (startline 39) EOF reached before encapsulated token > finished > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 > undefined) > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 > undefined) > ... 6 more > Caused by: java.io.IOException: (startline 39) EOF reached before > encapsulated token finished > at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 > undefined) > at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined) > at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142 > undefined)/ > {code} > Issue is coming when we encounter double quotes in one of the cells. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175286#comment-17175286 ] Peter Lee commented on TIKA-3155: - Hey. I think it's caused by the Quote Mode of Apache Commons CSV. We can simply fix this by turning the Quote Mode off. > Parse Error while extracting CSV files > -- > > Key: TIKA-3155 > URL: https://issues.apache.org/jira/browse/TIKA-3155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Priority: Major > Attachments: UTF-8_chars.csv > > > We are getting parse error while trying to extract csv files. > This was working in version 1.9, but exception coming in 1.24.1 > > {code:java} > /Exception in thread "main" org.apache.tika.exception.TikaException: > exception parsing the csv > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 > undefined) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 > undefined) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined) > Caused by: java.lang.IllegalStateException: IOException reading next record: > java.io.IOException: (startline 39) EOF reached before encapsulated token > finished > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 > undefined) > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 > undefined) > ... 6 more > Caused by: java.io.IOException: (startline 39) EOF reached before > encapsulated token finished > at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 > undefined) > at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined) > at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142 > undefined)/ > {code} > Issue is coming when we encounter double quotes in one of the cells. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Should we add Apache Commons Lang to tika-core as a dependency?
Hi all, I'm working with TIKA-3141 recently and pushed a PR in github. As Keith suggested in the PR, maybe we should add Commons Lang to tika-core, as it seems Commons Lang are being used elsewhere in tika but not tika-core. Ideas? cheers, Lee
PRs on github need reviews
Hi all, I'm using Tika recently and found it fascinating! I pushed some PRs on github but it seems no one is reviewing(so are some other PRs on github). Maybe somebody could give me a hand? Here are the PRs: https://github.com/apache/tika/pull/334 (https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F334=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D) https://github.com/apache/tika/pull/333 (https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/1?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F333=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D) https://github.com/apache/tika/pull/332 (https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/2?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F332=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D) https://github.com/apache/tika/pull/331 (https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/3?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F331=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D) https://github.com/apache/tika/pull/330 (https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/4?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F330=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D) cheers, Lee
[jira] [Commented] (TIKA-3141) LINUX - Tika shouldn't throw an exception for an empty TIKA_CONFIG environment variable value
[ https://issues.apache.org/jira/browse/TIKA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167845#comment-17167845 ] Peter Lee commented on TIKA-3141: - Hi [~nick], I'm working on Tika recently and I'm interested in this. Trying to improve this. > LINUX - Tika shouldn't throw an exception for an empty TIKA_CONFIG > environment variable value > - > > Key: TIKA-3141 > URL: https://issues.apache.org/jira/browse/TIKA-3141 > Project: Tika > Issue Type: Bug > Components: config >Affects Versions: 1.20 > Environment: Any Linux distro. I'm running the bash shell. Not sure > about other platforms. >Reporter: Josh Burchard >Priority: Trivial > Original Estimate: 24h > Remaining Estimate: 24h > > On my Linux box I configure Tika using the TIKA_CONFIG environment variable > to point the Tika server at my config.xml file. Sometimes, however, I want > to clear this variable to use the default config and I noticed that Tika will > throw an exception and abort if I do the following: > export TIKA_CONFIG='' > Seems like a case that should be handled just by ignoring the empty value > (i.e., there's no config to be used so go with the default) or at the most, > log a warning that the variable was detected but it's value is empty, but > still carry on using the default config. > {{Exception in thread "main" java.lang.RuntimeException: Unable to access > default configuration}} > \{{ at > org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:410)}} > \{{ at org.apache.tika.Tika.(Tika.java:116)}} > \{{ at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:125)}} > {{*Caused by: org.apache.tika.exception.TikaException: Specified Tika > configuration not found:*}} > \{{ at > org.apache.tika.config.TikaConfig.getConfigInputStream(TikaConfig.java:317)}} > \{{ at org.apache.tika.config.TikaConfig.(TikaConfig.java:254)}} > \{{ at > org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:405)}} -- This message was sent by Atlassian Jira (v8.3.4#803005)