Re: [VOTE] Release Apache Tika 2.0.0-ALPHA Candidate #1

2021-01-15 Thread Peter Lee
Here's my +1

On 1 15 2021, at 2:44, Tilman Hausherr  wrote:
> +1
>
> Tilman
> Am 14.01.2021 um 02:19 schrieb Tim Allison:
> > All,
> >
> > A candidate for the Tika 2.0.0-ALPHA release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> > https://github.com/apache/tika/tree/2.0.0-ALPHA-rc1/
> >
> > The SHA-512 checksum of the archive is
> >
> > ae018f4384d2cd63281422cc82ec71a5b6f5d64ac29b343d714737e6b35fee6e5d0190cd065bf069948eadeeea831c5d74a6da6a554f049d3075f40eeb984f13.
> >
> > In addition, a staged maven repository is available here:
> >
> > https://repository.apache.org/content/repositories/orgapachetika-1065/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 2.0.0-ALPHA.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > Note: there may be still breaking changes before the formal release of
> > 2.0.0.
> >
> > Here's my +1.
> >
> > Best,
> >
> > Tim
> >
> > [ ] +1 Release this package as Apache Tika 2.0.0-ALPHA
> > [ ] -1 Do not release this package because...
> >
>



[jira] [Commented] (TIKA-3180) Tika 2.0.0 -- Modularize tika-server

2020-12-20 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252556#comment-17252556
 ] 

Peter Lee commented on TIKA-3180:
-

It works now. :)

> Tika 2.0.0 -- Modularize tika-server
> 
>
> Key: TIKA-3180
> URL: https://issues.apache.org/jira/browse/TIKA-3180
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>  Labels: 2.0.0
> Fix For: 2.0.0
>
>
> What do fellow devs think about having a tika-server-core which would be just 
> the server/wrapper code without any dependencies on {{tika-parsers}} and then 
> a tika-server with the usual dependency on {{tika-parsers}}.
> As we move to more modularity, I'd think some users might want the server 
> {{tika-server-core}}, but then maybe only include a subset of the parsers.
> WDYT?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3180) Tika 2.0.0 -- Modularize tika-server

2020-12-18 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251612#comment-17251612
 ] 

Peter Lee commented on TIKA-3180:
-

Seems some tests are failed, see

[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/org.apache.tika$tika-server-core/lastBuild/console]

 

Looks like some test files are missed in the last commit. :)

Some of the files could be found in git log :

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/fake_oom.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/heavy_hang_100.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/heavy_hang_3.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/null_pointer.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/real_oom.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/system_exit.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/testStaticStdOutErr.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/testStdOutErr.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/thread_interrupt.xml

The others look like new added files :

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/hello_world.xml

   
tika-server/tika-server-core/src/test/resources/test-documents/mock/encrypted_document_exception.xml

 

Not sure why Jenkins only report an unstable build result. Considering some 
tests are failed, it should be a failed build test.

> Tika 2.0.0 -- Modularize tika-server
> 
>
> Key: TIKA-3180
> URL: https://issues.apache.org/jira/browse/TIKA-3180
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>  Labels: 2.0.0
>
> What do fellow devs think about having a tika-server-core which would be just 
> the server/wrapper code without any dependencies on {{tika-parsers}} and then 
> a tika-server with the usual dependency on {{tika-parsers}}.
> As we move to more modularity, I'd think some users might want the server 
> {{tika-server-core}}, but then maybe only include a subset of the parsers.
> WDYT?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: accidental merge

2020-12-14 Thread Peter Lee
> That one (0810700) I wanted to commit.
>

I see. Everything looks good now. :)
Lee
On 12 14 2020, at 4:35, Tilman Hausherr  wrote:
> Am 14.12.2020 um 08:48 schrieb Peter Lee:
> > Seems the latest commit 7f65d61 is exactly the same as dd85c73:
> > https://github.com/apache/tika/compare/dd85c73..7f65d61 
> > (https://github.com/apache/tika/compare/3571725..7f65d61)
> > https://github.com/apache/tika/compare/3571725..7f65d61
> > which means the commit 0810700 is not reverted yet.
>
>
> That one (0810700) I wanted to commit.
> My mistake is that I hadn't pulled all the changes before making the
> commit. Things then got bad.
>
> The next commit was apparently made implicitly after I had pushed the
> changes. I thought that I had reverted your changes but now I suspect
> that it had repeated your changes, but maybe an earlier point in (my?)
> history.
>
> Tilman
> > The 10 lines commits is not a big change. Maybe just modify them without 
> > using git revert is a good idea?
> > BTW we can use "git pull --rebase" to update local repo, then we can avoid 
> > the merge commit
> > cheers,
> > Lee
> > On 12 14 2020, at 2:08, Tilman Hausherr  wrote:
> >> I think I got it now.
> >>
> >> Someone please verify this:
> >> the last good commit is from Peter Lee "Simplify init code of some Set
> >> and List".
> >> then I made a small commit "TIKA-3248: avoid ClassCastException" of
> >> about 10 lines.
> >>
> >> then "bad" things happened.
> >> Ideally, the last three commits on top shouldn't have happened.
> >> Tilman
> >> Am 14.12.2020 um 06:51 schrieb Tilman Hausherr:
> >>> Hi all,
> >>>
> >>> I made an accidental merge and I tried to revert it, but I suspect I
> >>> made it worse. Still working on it...
> >>>
> >>> Tilman
> >>>
> >
>



Re: accidental merge

2020-12-13 Thread Peter Lee
Seems the latest commit 7f65d61 is exactly the same as dd85c73:
https://github.com/apache/tika/compare/dd85c73..7f65d61 
(https://github.com/apache/tika/compare/3571725..7f65d61)
https://github.com/apache/tika/compare/3571725..7f65d61
which means the commit 0810700 is not reverted yet.
The 10 lines commits is not a big change. Maybe just modify them without using 
git revert is a good idea?
BTW we can use "git pull --rebase" to update local repo, then we can avoid the 
merge commit
cheers,
Lee
On 12 14 2020, at 2:08, Tilman Hausherr  wrote:
> I think I got it now.
>
> Someone please verify this:
> the last good commit is from Peter Lee "Simplify init code of some Set
> and List".
> then I made a small commit "TIKA-3248: avoid ClassCastException" of
> about 10 lines.
>
> then "bad" things happened.
> Ideally, the last three commits on top shouldn't have happened.
> Tilman
> Am 14.12.2020 um 06:51 schrieb Tilman Hausherr:
> > Hi all,
> >
> > I made an accidental merge and I tried to revert it, but I suspect I
> > made it worse. Still working on it...
> >
> > Tilman
> >
>



[jira] [Resolved] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils

2020-12-04 Thread Peter Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Lee resolved TIKA-3218.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

> Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
> 
>
> Key: TIKA-3218
> URL: https://issues.apache.org/jira/browse/TIKA-3218
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>    Reporter: Peter Lee
>Priority: Minor
> Fix For: 2.0.0
>
>
>  
> Here is method sortLoadedClasses 's comment:
>  
> {code:java}
>  /** 
>* Sorts a list of loaded classes, so that non-Tika ones come 
>*  before Tika ones, and otherwise in reverse alphabetical order 
>*/
> {code}
> But you will find the method do the opposite thing if you check the code . 
> See [1]
> Also , If you run this test , you can see the Tika's class come before 
> non-Tika' class in the sorted list.
>  
> {code:java}
> @Test
> public void test() {
> List list = new ArrayList<>();
> list.add(new Object());
> list.add(new TikaException("abcd"));
> ServiceLoaderUtils.sortLoadedClasses(list);
> assertEquals(list.get(0).getClass().getName(), 
> "org.apache.tika.exception.TikaException");
> assertEquals(list.get(1).getClass().getName(), "java.lang.Object");
> }
> {code}
>  
>  
> I think the code is right and we need to modify the comment.
>  
> [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils

2020-12-04 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244377#comment-17244377
 ] 

Peter Lee commented on TIKA-3218:
-

Thank you for fix this (y)

> Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
> 
>
> Key: TIKA-3218
> URL: https://issues.apache.org/jira/browse/TIKA-3218
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>    Reporter: Peter Lee
>Priority: Minor
>
>  
> Here is method sortLoadedClasses 's comment:
>  
> {code:java}
>  /** 
>* Sorts a list of loaded classes, so that non-Tika ones come 
>*  before Tika ones, and otherwise in reverse alphabetical order 
>*/
> {code}
> But you will find the method do the opposite thing if you check the code . 
> See [1]
> Also , If you run this test , you can see the Tika's class come before 
> non-Tika' class in the sorted list.
>  
> {code:java}
> @Test
> public void test() {
> List list = new ArrayList<>();
> list.add(new Object());
> list.add(new TikaException("abcd"));
> ServiceLoaderUtils.sortLoadedClasses(list);
> assertEquals(list.get(0).getClass().getName(), 
> "org.apache.tika.exception.TikaException");
> assertEquals(list.get(1).getClass().getName(), "java.lang.Object");
> }
> {code}
>  
>  
> I think the code is right and we need to modify the comment.
>  
> [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer

2020-11-25 Thread Peter Lee
Many thanks to you, Tim. :)

Hi, all
I'm Peter Lee and I was a Apache Commons committer. I'm familiar with many 
archivers and compressors. Feel free to ask me if you have some problems in 
compression.
I'm honored to be part of Tika. Tika is great and it helped me a lot. Besides, 
Tika is a great community and it has helped a lot of users. I hope I can help 
Tika a little bit.
Once again, thank you all for making such a great community!
cheers,
Lee
On 11 25 2020, at 9:27, Tim Allison  wrote:
> All,
>
> The Tika PMC has elected to add Peter Lee to our ranks.
> Lee,
> Please introduce yourself, and welcome aboard!
>
> Cheers,
> Tim

Re: branch_1x tika-bundle issues

2020-11-17 Thread Peter Lee
Got the same problem.

After some investigation I believe it's caused by the version of 
maven-bundle-plugin :
I can successfully build branch_1x with version 4.1.0, but failed with version 
4.2.0, 4.2.1 and 5.1.1

Still working on finding out what's wrong here. Here this helps.
cheers,
Lee

On 11 18 2020, at 4:18 , Tim Allison  wrote:
> Is anyone else having problems building branch_1x? I'm having problems on
> ubuntu with openjdk8 -- I'm getting this error when the build tries to test
> the tika-bundle module:
>
> org.apache.tika.bundle
> org.osgi.framework.BundleException: Unable to resolve
> org.apache.tika.bundle [19](R 19.0): missing requirement
> [org.apache.tika.bundle [19](R 19.0)] osgi.wiring.package;
> (&(osgi.wiring.package=org.apache.tika.config)(version>=1.25.0)(!(version>=2.0.0)))
> [caused by: Unable to resolve org.apache.tika.core [13](R 13.0): missing
> requirement [org.apache.tika.core [13](R 13.0)] osgi.wiring.package;
> (osgi.wiring.package=org.apache.xerces.util)] Unresolved requirements:
> [[org.apache.tika.bundle [19](R 19.0)] osgi.wiring.package;
> (&(osgi.wiring.package=org.apache.tika.config)(version>=1.25.0)(!(version>=2.0.0)))]
>
> Any recommendations?
> Thank you!
> Cheers,
> Tim

[jira] [Commented] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils

2020-11-05 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227108#comment-17227108
 ] 

Peter Lee commented on TIKA-3218:
-

_so that user-provided ones would come first and would be able to override 
built-in Tika ones_

 

It seems the built-in Tika ones come first - you can check this via my test.

> Wrong comment for method sortLoadedClasses in ServiceLoaderUtils
> 
>
> Key: TIKA-3218
> URL: https://issues.apache.org/jira/browse/TIKA-3218
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>    Reporter: Peter Lee
>Priority: Minor
>
>  
> Here is method sortLoadedClasses 's comment:
>  
> {code:java}
>  /** 
>* Sorts a list of loaded classes, so that non-Tika ones come 
>*  before Tika ones, and otherwise in reverse alphabetical order 
>*/
> {code}
> But you will find the method do the opposite thing if you check the code . 
> See [1]
> Also , If you run this test , you can see the Tika's class come before 
> non-Tika' class in the sorted list.
>  
> {code:java}
> @Test
> public void test() {
> List list = new ArrayList<>();
> list.add(new Object());
> list.add(new TikaException("abcd"));
> ServiceLoaderUtils.sortLoadedClasses(list);
> assertEquals(list.get(0).getClass().getName(), 
> "org.apache.tika.exception.TikaException");
> assertEquals(list.get(1).getClass().getName(), "java.lang.Object");
> }
> {code}
>  
>  
> I think the code is right and we need to modify the comment.
>  
> [1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3218) Wrong comment for method sortLoadedClasses in ServiceLoaderUtils

2020-10-30 Thread Peter Lee (Jira)
Peter Lee created TIKA-3218:
---

 Summary: Wrong comment for method sortLoadedClasses in 
ServiceLoaderUtils
 Key: TIKA-3218
 URL: https://issues.apache.org/jira/browse/TIKA-3218
 Project: Tika
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Peter Lee


 

Here is method sortLoadedClasses 's comment:

 
{code:java}
 /** 
   * Sorts a list of loaded classes, so that non-Tika ones come 
   *  before Tika ones, and otherwise in reverse alphabetical order 
   */
{code}
But you will find the method do the opposite thing if you check the code . See 
[1]

Also , If you run this test , you can see the Tika's class come before 
non-Tika' class in the sorted list.

 
{code:java}
@Test
public void test() {
List list = new ArrayList<>();
list.add(new Object());
list.add(new TikaException("abcd"));
ServiceLoaderUtils.sortLoadedClasses(list);

assertEquals(list.get(0).getClass().getName(), 
"org.apache.tika.exception.TikaException");
assertEquals(list.get(1).getClass().getName(), "java.lang.Object");
}
{code}
 

 

I think the code is right and we need to modify the comment.

 

[1]https://github.com/apache/tika/blob/6d2312a98cb4d9698c73158c2e28d296756ef959/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java#L30



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3213) Consider migrating universalcharsetdetector to a live fork

2020-10-24 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220030#comment-17220030
 ] 

Peter Lee commented on TIKA-3213:
-

This fork repository don't support Chinese charset detect since version 2.0.0.

See this issue : [https://github.com/albfernandez/juniversalchardet/issues/34]

It might be a problem.

> Consider migrating universalcharsetdetector to a live fork
> --
>
> Key: TIKA-3213
> URL: https://issues.apache.org/jira/browse/TIKA-3213
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I just came across this living fork of the aged juniversalchardet (2011!!!): 
> https://github.com/albfernandez/juniversalchardet
> It has a mozilla license, has decent star count and is published on maven 
> central.
> Obv, we'll want to run a comparison on our corpus before making this change, 
> but I wanted to open this issue for discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika

2020-10-19 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217212#comment-17217212
 ] 

Peter Lee commented on TIKA-3209:
-

Hi [~nick]

Just replace PicturesSource in Tika with PictureRunMapper in POI in my fork 
repo with commit 232643b. see [5]

got these test failures:   see[6]
{code:java}
Error:  Failures: 
3558Error:POIContainerExtractionTest.testEmbeddedImages:90 expected:<1> but 
was:<0>
3559Error:POIContainerExtractionTest.testEmbeddedStorageId:137 
expected:<{F4754C9B-64F5-4B40-8AF4-679732AC0607}> but was:
3560Error:OOXMLContainerExtractionTest.testEmbeddedOfficeFiles:170 
expected:<24> but was:<22>
3561Error:SXWPFExtractorTest.testEmbedded:761 expected:<16> but 
was:<15>{code}
 

[5] 
[https://github.com/PeterAlfredLee/tika/commit/232643b27bdd7798f94b64931b5070d667f8dc29]

[6] 
[https://github.com/PeterAlfredLee/tika/runs/1278568384?check_suite_focus=true]

 

> Different between PictureRunMapper in POI and PicturesSource in Tika
> 
>
> Key: TIKA-3209
> URL: https://issues.apache.org/jira/browse/TIKA-3209
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Peter Lee
>Priority: Minor
>
> 1. In git log of POI, class PictureRunMapper was copy from class 
> PicturesSource in Tika. see [1]
> 2. This TODO of Tika suggest replace PicturesSource with PictureRunMapper 
> once POI 3.18 is out. see [2]
> So I try to replace but got a test fail.
> I think it may because the different between in method nextUnclaimed in these 
> two classes. see [3][4]
>  
> Can we remove this line in POI ? see [4]
>  
> [1] 
> [https://github.com/apache/poi/commit/bdb0e8199bb6891b068e97da69d6410870e8066b]
> [2] 
> [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L641]
> [3]
> [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L709]
>  
> [4] 
> [https://github.com/apache/poi/blob/f509d1deae86866ed531f10f2eba7db17e098473/src/scratchpad/src/org/apache/poi/hwpf/usermodel/PictureRunMapper.java#L130]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika

2020-10-18 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216391#comment-17216391
 ] 

Peter Lee commented on TIKA-3209:
-

 [~nick]

Could you give some advice ?

Can we remove that line in POI ? see [4]

> Different between PictureRunMapper in POI and PicturesSource in Tika
> 
>
> Key: TIKA-3209
> URL: https://issues.apache.org/jira/browse/TIKA-3209
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>    Reporter: Peter Lee
>Priority: Minor
>
> 1. In git log of POI, class PictureRunMapper was copy from class 
> PicturesSource in Tika. see [1]
> 2. This TODO of Tika suggest replace PicturesSource with PictureRunMapper 
> once POI 3.18 is out. see [2]
> So I try to replace but got a test fail.
> I think it may because the different between in method nextUnclaimed in these 
> two classes. see [3][4]
>  
> Can we remove this line in POI ? see [4]
>  
> [1] 
> [https://github.com/apache/poi/commit/bdb0e8199bb6891b068e97da69d6410870e8066b]
> [2] 
> [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L641]
> [3]
> [https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L709]
>  
> [4] 
> [https://github.com/apache/poi/blob/f509d1deae86866ed531f10f2eba7db17e098473/src/scratchpad/src/org/apache/poi/hwpf/usermodel/PictureRunMapper.java#L130]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3209) Different between PictureRunMapper in POI and PicturesSource in Tika

2020-10-13 Thread Peter Lee (Jira)
Peter Lee created TIKA-3209:
---

 Summary: Different between PictureRunMapper in POI and 
PicturesSource in Tika
 Key: TIKA-3209
 URL: https://issues.apache.org/jira/browse/TIKA-3209
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Peter Lee


1. In git log of POI, class PictureRunMapper was copy from class PicturesSource 
in Tika. see [1]

2. This TODO of Tika suggest replace PicturesSource with PictureRunMapper once 
POI 3.18 is out. see [2]


So I try to replace but got a test fail.

I think it may because the different between in method nextUnclaimed in these 
two classes. see [3][4]

 

Can we remove this line in POI ? see [4]

 

[1] 
[https://github.com/apache/poi/commit/bdb0e8199bb6891b068e97da69d6410870e8066b]


[2] 
[https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L641]


[3]

[https://github.com/apache/tika/blob/172d40322f5662e428850ad7a8fb4113e453a51c/tika-parser-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java#L709]

 

[4] 

[https://github.com/apache/poi/blob/f509d1deae86866ed531f10f2eba7db17e098473/src/scratchpad/src/org/apache/poi/hwpf/usermodel/PictureRunMapper.java#L130]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor

2020-09-22 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200448#comment-17200448
 ] 

Peter Lee edited comment on TIKA-3196 at 9/23/20, 2:13 AM:
---

Hi [~tallison]

I wrote a test here : 
[https://github.com/apache/tika/pull/356#issuecomment-696721537]

I forged a zip in memory that's small enough(~100 b) and we do not need to 
attach a zip archive to reproduce this. Hope this helps.


was (Author: peterlee):
Hi [~tallison]

I wrote a test here : 
[https://github.com/apache/tika/pull/356#issuecomment-696721537]

I forged a zip in memory that's small enough(~100 kb) and we do not need to 
attach a zip archive to reproduce this. Hope this helps.

> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -
>
> Key: TIKA-3196
> URL: https://issues.apache.org/jira/browse/TIKA-3196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Trevor Bentley
>Priority: Major
> Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3196) PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor

2020-09-22 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200448#comment-17200448
 ] 

Peter Lee commented on TIKA-3196:
-

Hi [~tallison]

I wrote a test here : 
[https://github.com/apache/tika/pull/356#issuecomment-696721537]

I forged a zip in memory that's small enough(~100 kb) and we do not need to 
attach a zip archive to reproduce this. Hope this helps.

> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -
>
> Key: TIKA-3196
> URL: https://issues.apache.org/jira/browse/TIKA-3196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Trevor Bentley
>Priority: Major
> Attachments: OOO-107047-0.oxt-145.zip
>
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3197) TikaInputStream may not be closed

2020-09-14 Thread Peter Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Lee resolved TIKA-3197.
-
Resolution: Not A Problem

> TikaInputStream may not be closed
> -
>
> Key: TIKA-3197
> URL: https://issues.apache.org/jira/browse/TIKA-3197
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>    Reporter: Peter Lee
>Priority: Minor
>
> This is TikaInputStream's close method : 
> {code:java}
> public void close() throws IOException {
> path = null;
> mark = -1;
> tmp.addResource(in);
> tmp.close();
> }
> {code}
> It will clean the TemporaryResources and close the InputStream which we 
> orginal get in parameter.
>  
> This is TikaInputStream's get method : 
> {code:java}
> public static TikaInputStream get(InputStream stream, TemporaryResources tmp) 
> {
> if (stream == null) {
> throw new NullPointerException("The Stream must not be null");
> }
> if (stream instanceof TikaInputStream) {
> return (TikaInputStream) stream;
> } else {
> // Make sure that the stream is buffered and that it
> // (properly) supports the mark feature
> if (!(stream.markSupported())) {
> stream = new BufferedInputStream(stream);
> }
> return new TikaInputStream(stream, tmp, -1);
> }
> }{code}
> If stream is not instance of TikaInputStream,  *it will create and return a 
> new instance of TikaInputStream.* 
> And as you can see , *we will not close this new instance in close method in 
> this case*. We will only close the  InputStream which we orginal get in 
> parameter.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3197) TikaInputStream may not be closed

2020-09-14 Thread Peter Lee (Jira)
Peter Lee created TIKA-3197:
---

 Summary: TikaInputStream may not be closed
 Key: TIKA-3197
 URL: https://issues.apache.org/jira/browse/TIKA-3197
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Peter Lee


This is TikaInputStream's close method : 
{code:java}
public void close() throws IOException {
path = null;
mark = -1;
tmp.addResource(in);
tmp.close();
}
{code}
It will clean the TemporaryResources and close the InputStream which we orginal 
get in parameter.

 

This is TikaInputStream's get method : 
{code:java}
public static TikaInputStream get(InputStream stream, TemporaryResources tmp) {
if (stream == null) {
throw new NullPointerException("The Stream must not be null");
}
if (stream instanceof TikaInputStream) {
return (TikaInputStream) stream;
} else {
// Make sure that the stream is buffered and that it
// (properly) supports the mark feature
if (!(stream.markSupported())) {
stream = new BufferedInputStream(stream);
}
return new TikaInputStream(stream, tmp, -1);
}
}{code}
If stream is not instance of TikaInputStream,  *it will create and return a new 
instance of TikaInputStream.* 

And as you can see , *we will not close this new instance in close method in 
this case*. We will only close the  InputStream which we orginal get in 
parameter.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: release planning?

2020-09-10 Thread Peter Lee
Everything looks good enough from me.

BTW I'm willing to participate in the developing of Tika 2.0. Any work that 
could assigned to me?
Lee
On 9 10 2020, at 12:31, Tim Allison  wrote:
> Hi Lee,
> Thank you for those PRs. I merged them into main and
> cherry-picked+resolved in branch_1x. Anything else?
>
> On Tue, Sep 8, 2020 at 9:56 PM Peter Lee  wrote:
> > Hi Tim,
> >
> > I pushed some bugfix PRs in github and maybe we could have a look if they
> > should be merged into branch_1x :
> > #330 : URLs update
> > #340 : some minor fix TikaCLI
> > #347 : minor fix for BatchProcessBuilder
> > #353 : fix for tests failure for those developers whose default language
> > is not English
> >
> > And I think these 2 improvements could also be considered :
> > #332 : fix for tests failure in Windows caused by the failing delete of
> > temp files
> > #334 : add empty check for TikaConifg.
> >
> > BTW these PRs are pushed to the main branch. I can rebase and push them to
> > branch_1x if needed.
> > WDYT?
> > cheers,
> > Lee
> >
> > On 9 9 2020, at 3:38, Tim Allison  wrote:
> > > Hi All,
> > >
> > > What would you say to kicking off the release process for 1.25 in the
> > > next couple of weeks? I want to fix TIKA-3186 before that release.
> > > Anything else?
> > >
> > > For 2.0.0-ALPHA, I think once we get the OSGi bundles back in, we should
> > > be good to go? Do we want to convert to gradle before release? I've
> > > punted on multi-parsers. What else do we need before an alpha release?
> > >
> > > Cheers,
> > > Tim
>



Re: release planning?

2020-09-08 Thread Peter Lee
Hi Tim,

I pushed some bugfix PRs in github and maybe we could have a look if they 
should be merged into branch_1x :
#330 : URLs update
#340 : some minor fix TikaCLI
#347 : minor fix for BatchProcessBuilder
#353 : fix for tests failure for those developers whose default language is not 
English

And I think these 2 improvements could also be considered :
#332 : fix for tests failure in Windows caused by the failing delete of temp 
files
#334 : add empty check for TikaConifg.

BTW these PRs are pushed to the main branch. I can rebase and push them to 
branch_1x if needed.
WDYT?
cheers,
Lee

On 9 9 2020, at 3:38, Tim Allison  wrote:
> Hi All,
>
> What would you say to kicking off the release process for 1.25 in the
> next couple of weeks? I want to fix TIKA-3186 before that release.
> Anything else?
>
> For 2.0.0-ALPHA, I think once we get the OSGi bundles back in, we should
> be good to go? Do we want to convert to gradle before release? I've
> punted on multi-parsers. What else do we need before an alpha release?
>
> Cheers,
> Tim

Re: Tests failed in windows but not in linux

2020-08-24 Thread Peter Lee
Hi Bob,

I think I have found out what's wrong. Seems there's a infinite loop. I have 
pushed a PR, please have a look at :
https://github.com/apache/tika/pull/343

cheers,
Lee

On 8 24 2020, at 8:54 , Bob Paulin  wrote:
>
> Hi Lee,
>
> I get the same error on windows with GeoParser and SentimentAnalysisParser on 
> the main branch. Removing the Logger fixes both and it builds cleanly. Still 
> not sure what the exact issue is but I can recreate the issue and your 
> solution.
> - Bob
> On 8/24/2020 4:02 AM, Peter Lee wrote:
> >
> > Update :
> >
> > It works after I removed the loggers in GeoParser and GeoParserConfig. But 
> > I'm still not clear what exactly the problem is. :(
> > Lee
> > On 8 24 2020, at 3:27 , Peter Lee  
> > (mailto:peter...@apache.org) wrote:
> > >
> > > Hi all,
> > >
> > > The tests are failing on my windows : the GeoParserTest are failing cause 
> > > the class org.apache.tika.parser.geo.GeoParser cloud not be found. But 
> > > everything works fine on my Ubuntu.
> > > The error is wired. I did some googling but couldn't figure out what's 
> > > the problem.
> > > Anyone who got same error in Windows?
> > > Lee
> >
>



Re: Tests failed in windows but not in linux

2020-08-24 Thread Peter Lee
Update :

It works after I removed the loggers in GeoParser and GeoParserConfig. But I'm 
still not clear what exactly the problem is. :(
Lee
On 8 24 2020, at 3:27 , Peter Lee  wrote:
> Hi all,
>
> The tests are failing on my windows : the GeoParserTest are failing cause the 
> class org.apache.tika.parser.geo.GeoParser cloud not be found. But everything 
> works fine on my Ubuntu.
> The error is wired. I did some googling but couldn't figure out what's the 
> problem.
> Anyone who got same error in Windows?
> Lee

Tests failed in windows but not in linux

2020-08-24 Thread Peter Lee
Hi all,

The tests are failing on my windows : the GeoParserTest are failing cause the 
class org.apache.tika.parser.geo.GeoParser cloud not be found. But everything 
works fine on my Ubuntu.
The error is wired. I did some googling but couldn't figure out what's the 
problem.
Anyone who got same error in Windows?
Lee

Re: Windows build errors

2020-08-19 Thread Peter Lee
Hi Tilman,

> expected: but was: charset=[windows-1252]>
I think this problem is caused by the charset detection strategy basing on line 
separator(CRLF or LF) and the git autocrlf config. I also met this problem and 
solved it like this :
Set autocrlf false by git config --global core.autocrlf 
(https://link.getmailspring.com/link/1d55e0fb-1334-4857-ac3f-314524468...@getmailspring.com/0?redirect=core.autocrlf=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D)
 false
Delete everything except for the .git directory

Restore all the code by git reset --hard

This works on me. Hope it helps.

cheers,
Lee

On 8 20 2020, at 1:30 , Tilman Hausherr  wrote:
> After many weeks I checked out the "main" branch, and get these build
> errors:
>
> Failures:
> TestMimeTypes.testArchiveDetection:395->assertTypeByData:1275 
> expected:<[application/x-archive]> but was:<[text/plain]>
> AutoDetectParserTest.testHTML:192->assertAutoDetect:145->assertAutoDetect:128->assertAutoDetect:99
>  Bad content type: Test parameters:
> resourceRealName = /test-documents/testHTML.html
> resourceStatedName = /test-documents/testHTML.html
> realType = text/html; charset=ISO-8859-1
> statedType = null
> expectedContentFragment = Test Indexation Html
> expected: but was: charset=[windows-1252]>
> AutoDetectParserTest.testText:224->assertAutoDetect:145->assertAutoDetect:128->assertAutoDetect:99
>  Bad content type: Test parameters:
> resourceRealName = /test-documents/testTXT.txt
> resourceStatedName = /test-documents/testTXT.txt
> realType = text/plain; charset=ISO-8859-1
> statedType = null
> expectedContentFragment = indexation de Txt
> expected: but was: charset=[windows-1252]>
> TextAndCSVParserTest.testSubclassingMimeTypesRemain:217 
> expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar; 
> charset=[windows-1252]>
> ArParserTest.testArParsing:43 expected:<[application/x-archive]> but 
> was:<[text/plain; charset=windows-1252]>
> ArParserTest.testEmbedded:75 expected:<1> but was:<0>
> TXTParserTest.testSubclassingMimeTypesRemain:299 expected:<...-vcalendar; 
> charset=[ISO-8859-1]> but was:<...-vcalendar; charset=[windows-1252]>
> Errors:
> TestContainerAwareDetector.testBPList:571->assertTypeByData:66->assertTypeByNameAndData:76->assertTypeByNameAndData:96
>  » ArrayIndexOutOfBounds
> PListParserTest.testWebArchive:48->TikaTest.getRecursiveMetadata:232->TikaTest.getRecursiveMetadata:306
>  » ArrayIndexOutOfBounds
> PDFParserTest.testFileInAnnotationExtractedIfNoContents:1491->TikaTest.assertContains:112
>  » NullPointer

[jira] [Commented] (TIKA-1770) AutoDetectParser wrongly detects plain text as images/audio

2020-08-15 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178184#comment-17178184
 ] 

Peter Lee commented on TIKA-1770:
-

Test 3 given file in tika-1.24.1 . here is tika content-type detection result :

 
||File Name||Content Type||
|the-acl-rd-tec_chunk_15.txt|audio/mpeg|
|the-acl-rd-tec_chunk_9113.txt|image/x-portable-bitmap|
|the-acl-rd-tec_chunk_10228.txt|image/x-portable-bitmap|

 

Reason:

Content of file `the-acl-rd-tec_chunk_15.txt` start with string "ID3" which is 
magic byte of audio/mpeg.

Content of file `the-acl-rd-tec_chunk_9113.txt` start with string "P1" which is 
magic byte of image/x-portable-bitmap.

Content of file `the-acl-rd-tec_chunk_10228.txt` start with string "P4" which 
is magic byte of image/x-portable-bitmap.

 

After google these two formats, I can't find the way to improve these formats 
magic byte match configure.

Maybe we should setup a rule : some format must have both extendtion name and 
magic byte match.

> AutoDetectParser wrongly detects plain text as images/audio
> ---
>
> Key: TIKA-1770
> URL: https://issues.apache.org/jira/browse/TIKA-1770
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
> Environment: OS independent (tested on both Windows, MAC OS)
>Reporter: Ziqi
>Priority: Minor
> Attachments: the-acl-rd-tec_chunk_10228.txt, 
> the-acl-rd-tec_chunk_15.txt, the-acl-rd-tec_chunk_9113.txt
>
>
> AutoDetectParser fails to recognize certain plain-text files as plain text.
> In the attachment are three testing files, as you can see they are all plain 
> text.
> The following code is used for testing:
> 
> AutoDetectParser parser = new AutoDetectParser();
> for (File f : new File("path").listFiles()) {
> InputStream in = new BufferedInputStream(new 
> FileInputStream(f.toString()));
> BodyContentHandler handler = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> try {
> parser.parse(in, handler, metadata);
> String content = handler.toString();
> System.out.println(metadata); //line A
> }catch (Exception e){
> e.printStackTrace();
> }
> }
> 
> for the three testing files, line A prints the following:
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap 
> X-Parsed-By=org.apache.tika.parser.DefaultParser 
> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
> Content-Type=audio/mpeg 
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap 
> And as a result, variable "content" is always empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files

2020-08-12 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176206#comment-17176206
 ] 

Peter Lee commented on TIKA-3155:
-

According to my understanding , here is how Tika handle csv file :
1. Try to parse with commons-csv first.
2. Parse the rest data in InputStream as plain text if encounter 
IllegalStateException.
 
Unfortunately, in this case , commons-csv has consumed all data in InputStream 
before it throws IllegalStateException , so there is nothing left in 
InputStream and we can't parse .
 
If we don't try to parse with commons-csv first then we don't know is it gonna 
to encounter IllegalStateException.
But if we try and encounter IllegalStateException, there is nothing we can do 
because all data has been consumed.
 
Maybe we can read all data form InputStream to a byte array, then we can try 
many way in many times. 
But this may cost a lot of memory size and I don't think is a smart way.
 
Maybe it's beter if we change nothing. Just let user to adjust their csv file 
when encounter IllegalStateException.

> Parse Error while extracting CSV files
> --
>
> Key: TIKA-3155
> URL: https://issues.apache.org/jira/browse/TIKA-3155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>   ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>   at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>   at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>   at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files

2020-08-11 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175290#comment-17175290
 ] 

Peter Lee commented on TIKA-3155:
-

We can do it in _TextAndCSVParser_ like this
{code:java}
CSVFormat csvFormat = 
CSVFormat.EXCEL.withDelimiter(params.getDelimiter()).withQuote(null);
{code}
 

I tested with Quote Mode off and it works.

 

 

> Parse Error while extracting CSV files
> --
>
> Key: TIKA-3155
> URL: https://issues.apache.org/jira/browse/TIKA-3155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>   ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>   at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>   at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>   at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files

2020-08-11 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175286#comment-17175286
 ] 

Peter Lee commented on TIKA-3155:
-

Hey. I think it's caused by the Quote Mode of Apache Commons CSV. We can simply 
fix this by turning the Quote Mode off.

> Parse Error while extracting CSV files
> --
>
> Key: TIKA-3155
> URL: https://issues.apache.org/jira/browse/TIKA-3155
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
> Attachments: UTF-8_chars.csv
>
>
> We are getting parse error while trying to extract csv files.
> This was working in version 1.9, but exception coming in 1.24.1
>  
> {code:java}
> /Exception in thread "main" org.apache.tika.exception.TikaException: 
> exception parsing the csv
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: java.lang.IllegalStateException: IOException reading next record: 
> java.io.IOException: (startline 39) EOF reached before encapsulated token 
> finished
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145
>  undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 
> undefined)
>   at 
> org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 
> undefined)
>   ... 6 more
> Caused by: java.io.IOException: (startline 39) EOF reached before 
> encapsulated token finished
>   at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 
> undefined)
>   at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined)
>   at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 
> undefined)
>   at 
> org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142
>  undefined)/ 
> {code}
> Issue is coming when we encounter double quotes in one of the cells.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Should we add Apache Commons Lang to tika-core as a dependency?

2020-08-03 Thread Peter Lee
Hi all,

I'm working with TIKA-3141 recently and pushed a PR in github. As Keith 
suggested in the PR, maybe we should add Commons Lang to tika-core, as it seems 
Commons Lang are being used elsewhere in tika but not tika-core.
Ideas?
cheers,
Lee


PRs on github need reviews

2020-07-30 Thread Peter Lee
Hi all,

I'm using Tika recently and found it fascinating!
I pushed some PRs on github but it seems no one is reviewing(so are some other 
PRs on github). Maybe somebody could give me a hand?
Here are the PRs:
https://github.com/apache/tika/pull/334 
(https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F334=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D)
https://github.com/apache/tika/pull/333 
(https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/1?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F333=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D)
https://github.com/apache/tika/pull/332 
(https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/2?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F332=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D)
https://github.com/apache/tika/pull/331 
(https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/3?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F331=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D)
https://github.com/apache/tika/pull/330 
(https://link.getmailspring.com/link/1d3c50c9-6836-40df-898a-56d396f33...@getmailspring.com/4?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Ftika%2Fpull%2F330=ZGV2QHRpa2EuYXBhY2hlLm9yZw%3D%3D)
cheers,
Lee



[jira] [Commented] (TIKA-3141) LINUX - Tika shouldn't throw an exception for an empty TIKA_CONFIG environment variable value

2020-07-30 Thread Peter Lee (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167845#comment-17167845
 ] 

Peter Lee commented on TIKA-3141:
-

Hi [~nick], I'm working on Tika recently and I'm interested in this. Trying to 
improve this.

> LINUX - Tika shouldn't throw an exception for an empty TIKA_CONFIG 
> environment variable value
> -
>
> Key: TIKA-3141
> URL: https://issues.apache.org/jira/browse/TIKA-3141
> Project: Tika
>  Issue Type: Bug
>  Components: config
>Affects Versions: 1.20
> Environment: Any Linux distro.  I'm running the bash shell.  Not sure 
> about other platforms.
>Reporter: Josh Burchard
>Priority: Trivial
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> On my Linux box I configure Tika using the TIKA_CONFIG environment variable 
> to point the Tika server at my config.xml file.   Sometimes, however, I want 
> to clear this variable to use the default config and I noticed that Tika will 
> throw an exception and abort if I do the following:
> export TIKA_CONFIG=''
> Seems like a case that should be handled just by ignoring the empty value 
> (i.e., there's no config to be used so go with the default)  or at the most, 
> log a warning that the variable was detected but it's value is empty, but 
> still carry on using the default config.
> {{Exception in thread "main" java.lang.RuntimeException: Unable to access 
> default configuration}}
>  \{{ at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:410)}}
>  \{{ at org.apache.tika.Tika.(Tika.java:116)}}
>  \{{ at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:125)}}
>  {{*Caused by: org.apache.tika.exception.TikaException: Specified Tika 
> configuration not found:*}}
>  \{{ at 
> org.apache.tika.config.TikaConfig.getConfigInputStream(TikaConfig.java:317)}}
>  \{{ at org.apache.tika.config.TikaConfig.(TikaConfig.java:254)}}
>  \{{ at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:405)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)