[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2021-07-19 Thread Yaniv Kunda (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383397#comment-17383397
 ] 

Yaniv Kunda commented on TIKA-1706:
---

What a blast from the past...

Thanks!

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2016-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219308#comment-15219308
 ] 

Hudson commented on TIKA-1706:
--

FAILURE: Integrated in tika-2.x #65 (See 
[https://builds.apache.org/job/tika-2.x/65/])
TIKA-1915 and TIKA-1706 - Remove POI. Replace with commons-io+tika-core (bob: 
rev 05f4af3002f1f376095f6b4810d505ea50d08b3c)
* tika-parser-modules/tika-parser-cad-module/pom.xml
* tika-core/src/main/java/org/apache/tika/io/IOExceptionWithCause.java
* 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/PSDParser.java
* tika-core/src/main/java/org/apache/tika/io/ClosedInputStream.java
* tika-core/pom.xml
* 
tika-parser-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java
* 
tika-langdetect/src/test/java/org/apache/tika/langdetect/OptimaizeLangDetectorTest.java
* tika-core/src/main/java/org/apache/tika/io/TaggedIOException.java
* tika-core/src/main/java/org/apache/tika/io/IOUtils.java
* tika-core/src/test/java/org/apache/tika/TypeDetectionBenchmark.java
* tika-core/src/main/java/org/apache/tika/parser/NetworkParser.java
* tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
* 
tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java
* tika-core/src/main/java/org/apache/tika/io/StringUtil.java
* 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/BPGParser.java
* tika-core/src/main/java/org/apache/tika/Tika.java
* tika-core/src/test/java/org/apache/tika/sax/SecureContentHandlerTest.java
* tika-parser-modules/tika-parser-multimedia-module/pom.xml
* 
tika-parser-modules/tika-parser-advanced-module/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java
* 
tika-parser-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/dwg/DWGParser.java
* tika-app/src/test/java/org/apache/tika/parser/mock/MockParserTest.java
* tika-parser-bundles/tika-parser-cad-bundle/pom.xml
* tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java
* 
tika-parser-modules/tika-parser-advanced-module/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java
* tika-core/src/test/java/org/apache/tika/TikaTest.java
* tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
* tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml
* 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java
* tika-core/src/main/java/org/apache/tika/io/NullInputStream.java
* tika-core/src/main/java/org/apache/tika/embedder/ExternalEmbedder.java
* 
tika-langdetect/src/test/java/org/apache/tika/langdetect/LanguageDetectorTest.java
* tika-core/src/main/java/org/apache/tika/io/CloseShieldInputStream.java
* tika-core/src/main/java/org/apache/tika/io/NullOutputStream.java
* tika-core/src/main/java/org/apache/tika/sax/OfflineContentHandler.java
* tika-core/src/main/java/org/apache/tika/fork/ForkClient.java
* tika-core/src/main/java/org/apache/tika/io/CountingInputStream.java


> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.13
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2016-03-30 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219299#comment-15219299
 ] 

Bob Paulin commented on TIKA-1706:
--

Performed removal of duplicated tika.io classes and replaced with commons-io 
2.4 in the Tika 2.0 branch.

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.13
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-11-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029955#comment-15029955
 ] 

Tim Allison commented on TIKA-1706:
---

+1

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.12
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-11-26 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028642#comment-15028642
 ] 

Nick Burch commented on TIKA-1706:
--

Does anyone have any objections to us going ahead with this for Tika 1.12?

If no objections are raised in 1 week (by 2015-12-03), then I think we should 
go ahead and commit

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.12
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717349#comment-14717349
 ] 

Yaniv Kunda commented on TIKA-1706:
---

The fact that o.a.tika.io contains public classes is a problem I didn't think 
about -
these files are strictly meant as internal utility/support classes and 
shouldn't really be used by users.
In fact, I would say although these are public classes, they should not be 
considered a part of the public API of tika-core.
And since we don't know what commons-io-cloned classes users use (probably by 
accident), it is indeed a problem letting these go.

I also think that the no-dependencies principle is more romantic than it is 
useful, as these days a lot of the Java ecosystem is built on using external 
libraries, unless space is critical such as in mobile applications (and even 
these are getting bigger and bigger).
As the vast majority of tika-core usages comes transitively from tika-parsers, 
I think this is not the case.
I haven't crawled maven repo (deep enough) to find how many tika-code exclusive 
usages have a few or no other dependencies, but I suspect that number is not 
very high.
So the absolute worst case here - and remember that this is the extreme case of 
a library that uses tika-core and no other library - is a 30% footprint 
increase!

o.a.tika.io is a mess - it contains:
- classes from commons-io-1.4
- partial classes from commons-io-1.4
- modified classes from commons-io-1.4
- classes from commons-io-2.0 (or later unknown version/s)
- tika original classes

It's really hard going over all changes - and I've shown just a few examples - 
but just doing the switch is simply easier, not so costly even in the worst 
case, and would bring progress to our doorstep (today and in future changes) by 
exploration faster than maintaining copied code.

My suggestion is:
- bring commons-io back to tika-core
- change all usages of the copied classes to commons-io
- deprecate (do not delete) the copied classes, probably until tika-2.0




 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717519#comment-14717519
 ] 

Yaniv Kunda commented on TIKA-1706:
---

That's why I suggested to just add commons-io to tika-core, use it internally, 
and just deprecate the copied classes.
Is that ok for 1.x?

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717510#comment-14717510
 ] 

Nick Burch commented on TIKA-1706:
--

There are a non-zero number of parsers out there which are maintained 
externally. Many of those will make use of the org.apache.tika.io classes. Not 
all of them will be in maven central to see. We can't break those in the 1.x 
series, so we'll need to retain the classes + methods + behaviours. Whether 
that's through keeping the classes, or keep+upgrade, or just making them pass 
things through to a new Commons IO release - I'm neutral on, but we can't just 
remove them in the short term

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716318#comment-14716318
 ] 

Jukka Zitting commented on TIKA-1706:
-

Note that o.a.tika.io is a part of the public API of tika-core, so even if we 
restore the commons-io dependency we should keep these classes for backwards 
compatibility (perhaps as dummies that just inherit the relevant commons-io 
classes or redirect static calls to there).

I don't have a strong opinion here. I do think that the no dependencies 
principle of tika-core is useful and worth the overhead of a dozen duplicated 
classes. And a 30% increase in the tika-core footprint because of the added 
dependency would still be non-trivial. On the other hand the argument about 
missing out on improvements in commons-io is valid.

Personally I'd start here by checking what exactly has changed in the classes 
we duplicate from commons-io. If it's just a few lines then I'd just merge 
those changes to Tika and be happy with that for the next five years. If there 
are more substantial improvements, switching back to a dependency is probably 
worth it.

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I've separated all the related changes besides adding commons-io to tika-core, 
and opened under TIKA-1710.
In addition, the recently added commons-io-unsafe check have now found a couple 
of more default encoding usages:
tika-core:   src\main\java\org\apache\tika\Tika.java
tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698401#comment-14698401
 ] 

Hudson commented on TIKA-1706:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #826 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/826/])
Use a consistent version of Commons IO everywhere, enable the Forbidden APIs 
check for it, and fix problems it found TIKA-1706 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1696079)
* /tika/trunk/tika-app/pom.xml
* /tika/trunk/tika-batch/pom.xml
* /tika/trunk/tika-example/pom.xml
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/DirListParser.java
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java
* 
/tika/trunk/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java
* /tika/trunk/tika-parent/pom.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-server/pom.xml
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698307#comment-14698307
 ] 

Nick Burch commented on TIKA-1706:
--

[~thetaphi] We currently have the forbidden apis check defined in the 
tika-parent pom. I've just tried adding 
{{{bundledSignaturecommons-io-unsafe-2.4/bundledSignature}}} there too, but 
that then causes the build of {{{tika-core}}} to fail, as core doesn't (yet) 
have commons-io available. Is there a way to make it skip the check if the 
classes aren't found, but do it if they are?

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698313#comment-14698313
 ] 

Uwe Schindler commented on TIKA-1706:
-

Yes, you can add the maven property 
{{failOnUnresolvableSignaturesfalse/failOnUnresolvableSignatures to the 
plugin configuration}}: 
[http://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/check-mojo.html#failOnUnresolvableSignatures]

An alternative is to only enable commons-io-unsafe-2.4 only for those modules 
where its used, unfortunately this is not so easy, because you cannot inherit 
only some array values to submodules, you miust reconfigure all 
bundledsignatures in submodules.

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697961#comment-14697961
 ] 

Uwe Schindler commented on TIKA-1706:
-

If you bring in commons-io, you should also add the corresponding 
forbidden-apis signatures to the POM. commons-io makes it easy to choose the 
wrong IOUtils/FileUtils method and then you are dependent to default charset 
again...

https://github.com/policeman-tools/forbidden-apis/wiki/BundledSignatures

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-14 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697852#comment-14697852
 ] 

Nick Burch commented on TIKA-1706:
--

Since tika-parsers already depends on Commons IO, would you be able to split 
your patch into two?

We can probably apply the tika-parsers changes / tidy-ups straight away, the 
first few at least look perfectly sensible to me. 

However, having the tika-core related changes independently will help with the 
review there, as that's the one I think will likely need more oversight and 
thinking. Especially from [~jukkaz], who made the original inlining changes, 
and might be best placed to comment on the updated plan now we're a few years 
later on

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-13 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695256#comment-14695256
 ] 

Nick Burch commented on TIKA-1706:
--

The latest Commons IO jar is 180kb, the inlined classes in Tika when in a jar 
are about 20kb, so it would increase the minimum Tika install size. 

Back when most of these classes were inlined, in Tika 0.4, the size of the Tika 
Core jar was only 129kb. These days, it's coming in at just over 560kb, so the 
size of the Commons IO jar is no longer such an issue relatively. However, we 
do currently manage without any required dependencies, which this would change

While most people do use Tika Core with Tika Parsers, not all people do, so it 
will have an impact on them

We'd also need to check if there are any enhancements or fixes that have been 
made to the inlined classes, and if so, work to get them upstream before any 
changes. Would you be able to check that?

Also, do you have any cases where having all of a newer Commons IO would 
improve/simplify/fix current Tika Core code?

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-13 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696025#comment-14696025
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I agree that generally adding an external dependency to a core module might 
have an impact,
but consider that unlike tika-core, commons-io is a true low-level library:
it has no compile-time dependencies and is used by 2500 projects in maven 
central alone.

I believe that copying the code of another library, frozen in time (in this 
case since 2008), hinders innovation and reduces the chance that anyone will 
utilize new improvements and fixes in newer commons-io since:
# it is disconnected from tika and requires manual discovery and research (if 
commons-io is used as an external dependency it's easy to find deprecated 
methods and their replacements using static analysis)
# it requires manual maintenance of copying select classes/code

It's not easy summing more than 7 years of changes in common-io, but here are 
some beneficial changes I found along the way:
- Use org.apache.commons.io.output.ByteArrayOutputStream instead of 
java.io.ByteArrayOutputStream (this class is actually not that new, but can 
benefit many uses and save a lot of byte-copying) - this has been further 
improved by providing an optimized InputStream from a 
org.apache.commons.io.output.ByteArrayOutputStream (IO-137)
- Allow using Charset instead of String encoding (IO-318)
- Use StringBuilderWriter instead of StringWriter to avoid unnecessary 
synchronization (IO-140)

Obviously, I did not propose this change just for the sake of disrupting the 
peace, but I plan and started a series of patches to utilize newer commons-io, 
which will follow - each in its own issue - once and if commons-io is added as 
a dependency to tika-core.


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)