[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383397#comment-17383397 ] Yaniv Kunda commented on TIKA-1706: --- What a blast from the past... Thanks! > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 2.0.0 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219308#comment-15219308 ] Hudson commented on TIKA-1706: -- FAILURE: Integrated in tika-2.x #65 (See [https://builds.apache.org/job/tika-2.x/65/]) TIKA-1915 and TIKA-1706 - Remove POI. Replace with commons-io+tika-core (bob: rev 05f4af3002f1f376095f6b4810d505ea50d08b3c) * tika-parser-modules/tika-parser-cad-module/pom.xml * tika-core/src/main/java/org/apache/tika/io/IOExceptionWithCause.java * tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/PSDParser.java * tika-core/src/main/java/org/apache/tika/io/ClosedInputStream.java * tika-core/pom.xml * tika-parser-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java * tika-langdetect/src/test/java/org/apache/tika/langdetect/OptimaizeLangDetectorTest.java * tika-core/src/main/java/org/apache/tika/io/TaggedIOException.java * tika-core/src/main/java/org/apache/tika/io/IOUtils.java * tika-core/src/test/java/org/apache/tika/TypeDetectionBenchmark.java * tika-core/src/main/java/org/apache/tika/parser/NetworkParser.java * tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java * tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java * tika-core/src/main/java/org/apache/tika/io/StringUtil.java * tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/BPGParser.java * tika-core/src/main/java/org/apache/tika/Tika.java * tika-core/src/test/java/org/apache/tika/sax/SecureContentHandlerTest.java * tika-parser-modules/tika-parser-multimedia-module/pom.xml * tika-parser-modules/tika-parser-advanced-module/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java * tika-parser-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/dwg/DWGParser.java * tika-app/src/test/java/org/apache/tika/parser/mock/MockParserTest.java * tika-parser-bundles/tika-parser-cad-bundle/pom.xml * tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java * tika-parser-modules/tika-parser-advanced-module/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java * tika-core/src/test/java/org/apache/tika/TikaTest.java * tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java * tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml * tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java * tika-core/src/main/java/org/apache/tika/io/NullInputStream.java * tika-core/src/main/java/org/apache/tika/embedder/ExternalEmbedder.java * tika-langdetect/src/test/java/org/apache/tika/langdetect/LanguageDetectorTest.java * tika-core/src/main/java/org/apache/tika/io/CloseShieldInputStream.java * tika-core/src/main/java/org/apache/tika/io/NullOutputStream.java * tika-core/src/main/java/org/apache/tika/sax/OfflineContentHandler.java * tika-core/src/main/java/org/apache/tika/fork/ForkClient.java * tika-core/src/main/java/org/apache/tika/io/CountingInputStream.java > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219299#comment-15219299 ] Bob Paulin commented on TIKA-1706: -- Performed removal of duplicated tika.io classes and replaced with commons-io 2.4 in the Tika 2.0 branch. > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029955#comment-15029955 ] Tim Allison commented on TIKA-1706: --- +1 > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.12 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028642#comment-15028642 ] Nick Burch commented on TIKA-1706: -- Does anyone have any objections to us going ahead with this for Tika 1.12? If no objections are raised in 1 week (by 2015-12-03), then I think we should go ahead and commit > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.12 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717349#comment-14717349 ] Yaniv Kunda commented on TIKA-1706: --- The fact that o.a.tika.io contains public classes is a problem I didn't think about - these files are strictly meant as internal utility/support classes and shouldn't really be used by users. In fact, I would say although these are public classes, they should not be considered a part of the public API of tika-core. And since we don't know what commons-io-cloned classes users use (probably by accident), it is indeed a problem letting these go. I also think that the no-dependencies principle is more romantic than it is useful, as these days a lot of the Java ecosystem is built on using external libraries, unless space is critical such as in mobile applications (and even these are getting bigger and bigger). As the vast majority of tika-core usages comes transitively from tika-parsers, I think this is not the case. I haven't crawled maven repo (deep enough) to find how many tika-code exclusive usages have a few or no other dependencies, but I suspect that number is not very high. So the absolute worst case here - and remember that this is the extreme case of a library that uses tika-core and no other library - is a 30% footprint increase! o.a.tika.io is a mess - it contains: - classes from commons-io-1.4 - partial classes from commons-io-1.4 - modified classes from commons-io-1.4 - classes from commons-io-2.0 (or later unknown version/s) - tika original classes It's really hard going over all changes - and I've shown just a few examples - but just doing the switch is simply easier, not so costly even in the worst case, and would bring progress to our doorstep (today and in future changes) by exploration faster than maintaining copied code. My suggestion is: - bring commons-io back to tika-core - change all usages of the copied classes to commons-io - deprecate (do not delete) the copied classes, probably until tika-2.0 Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717519#comment-14717519 ] Yaniv Kunda commented on TIKA-1706: --- That's why I suggested to just add commons-io to tika-core, use it internally, and just deprecate the copied classes. Is that ok for 1.x? Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717510#comment-14717510 ] Nick Burch commented on TIKA-1706: -- There are a non-zero number of parsers out there which are maintained externally. Many of those will make use of the org.apache.tika.io classes. Not all of them will be in maven central to see. We can't break those in the 1.x series, so we'll need to retain the classes + methods + behaviours. Whether that's through keeping the classes, or keep+upgrade, or just making them pass things through to a new Commons IO release - I'm neutral on, but we can't just remove them in the short term Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716318#comment-14716318 ] Jukka Zitting commented on TIKA-1706: - Note that o.a.tika.io is a part of the public API of tika-core, so even if we restore the commons-io dependency we should keep these classes for backwards compatibility (perhaps as dummies that just inherit the relevant commons-io classes or redirect static calls to there). I don't have a strong opinion here. I do think that the no dependencies principle of tika-core is useful and worth the overhead of a dozen duplicated classes. And a 30% increase in the tika-core footprint because of the added dependency would still be non-trivial. On the other hand the argument about missing out on improvements in commons-io is valid. Personally I'd start here by checking what exactly has changed in the classes we duplicate from commons-io. If it's just a few lines then I'd just merge those changes to Tika and be happy with that for the next five years. If there are more substantial improvements, switching back to a dependency is probably worth it. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477 ] Yaniv Kunda commented on TIKA-1706: --- I've separated all the related changes besides adding commons-io to tika-core, and opened under TIKA-1710. In addition, the recently added commons-io-unsafe check have now found a couple of more default encoding usages: tika-core: src\main\java\org\apache\tika\Tika.java tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698401#comment-14698401 ] Hudson commented on TIKA-1706: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #826 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/826/]) Use a consistent version of Commons IO everywhere, enable the Forbidden APIs check for it, and fix problems it found TIKA-1706 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1696079) * /tika/trunk/tika-app/pom.xml * /tika/trunk/tika-batch/pom.xml * /tika/trunk/tika-example/pom.xml * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/DirListParser.java * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java * /tika/trunk/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java * /tika/trunk/tika-parent/pom.xml * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-server/pom.xml * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698307#comment-14698307 ] Nick Burch commented on TIKA-1706: -- [~thetaphi] We currently have the forbidden apis check defined in the tika-parent pom. I've just tried adding {{{bundledSignaturecommons-io-unsafe-2.4/bundledSignature}}} there too, but that then causes the build of {{{tika-core}}} to fail, as core doesn't (yet) have commons-io available. Is there a way to make it skip the check if the classes aren't found, but do it if they are? Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698313#comment-14698313 ] Uwe Schindler commented on TIKA-1706: - Yes, you can add the maven property {{failOnUnresolvableSignaturesfalse/failOnUnresolvableSignatures to the plugin configuration}}: [http://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/check-mojo.html#failOnUnresolvableSignatures] An alternative is to only enable commons-io-unsafe-2.4 only for those modules where its used, unfortunately this is not so easy, because you cannot inherit only some array values to submodules, you miust reconfigure all bundledsignatures in submodules. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697961#comment-14697961 ] Uwe Schindler commented on TIKA-1706: - If you bring in commons-io, you should also add the corresponding forbidden-apis signatures to the POM. commons-io makes it easy to choose the wrong IOUtils/FileUtils method and then you are dependent to default charset again... https://github.com/policeman-tools/forbidden-apis/wiki/BundledSignatures Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14697852#comment-14697852 ] Nick Burch commented on TIKA-1706: -- Since tika-parsers already depends on Commons IO, would you be able to split your patch into two? We can probably apply the tika-parsers changes / tidy-ups straight away, the first few at least look perfectly sensible to me. However, having the tika-core related changes independently will help with the review there, as that's the one I think will likely need more oversight and thinking. Especially from [~jukkaz], who made the original inlining changes, and might be best placed to comment on the updated plan now we're a few years later on Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695256#comment-14695256 ] Nick Burch commented on TIKA-1706: -- The latest Commons IO jar is 180kb, the inlined classes in Tika when in a jar are about 20kb, so it would increase the minimum Tika install size. Back when most of these classes were inlined, in Tika 0.4, the size of the Tika Core jar was only 129kb. These days, it's coming in at just over 560kb, so the size of the Commons IO jar is no longer such an issue relatively. However, we do currently manage without any required dependencies, which this would change While most people do use Tika Core with Tika Parsers, not all people do, so it will have an impact on them We'd also need to check if there are any enhancements or fixes that have been made to the inlined classes, and if so, work to get them upstream before any changes. Would you be able to check that? Also, do you have any cases where having all of a newer Commons IO would improve/simplify/fix current Tika Core code? Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696025#comment-14696025 ] Yaniv Kunda commented on TIKA-1706: --- I agree that generally adding an external dependency to a core module might have an impact, but consider that unlike tika-core, commons-io is a true low-level library: it has no compile-time dependencies and is used by 2500 projects in maven central alone. I believe that copying the code of another library, frozen in time (in this case since 2008), hinders innovation and reduces the chance that anyone will utilize new improvements and fixes in newer commons-io since: # it is disconnected from tika and requires manual discovery and research (if commons-io is used as an external dependency it's easy to find deprecated methods and their replacements using static analysis) # it requires manual maintenance of copying select classes/code It's not easy summing more than 7 years of changes in common-io, but here are some beneficial changes I found along the way: - Use org.apache.commons.io.output.ByteArrayOutputStream instead of java.io.ByteArrayOutputStream (this class is actually not that new, but can benefit many uses and save a lot of byte-copying) - this has been further improved by providing an optimized InputStream from a org.apache.commons.io.output.ByteArrayOutputStream (IO-137) - Allow using Charset instead of String encoding (IO-318) - Use StringBuilderWriter instead of StringWriter to avoid unnecessary synchronization (IO-140) Obviously, I did not propose this change just for the sake of disrupting the peace, but I plan and started a series of patches to utilize newer commons-io, which will follow - each in its own issue - once and if commons-io is added as a dependency to tika-core. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)