[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383397#comment-17383397 ] Yaniv Kunda commented on TIKA-1706: --- What a blast from the past... Thanks! > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 2.0.0 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-1672) Integrate tika-java7 component
[ https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962596#comment-14962596 ] Yaniv Kunda commented on TIKA-1672: --- Here are some names I suggested: - tika-java7-spi - tika-java7-filetypedetector - tika-java7-detector-spi > Integrate tika-java7 component > -- > > Key: TIKA-1672 > URL: https://issues.apache.org/jira/browse/TIKA-1672 > Project: Tika > Issue Type: Improvement >Reporter: Tyler Palsulich > Fix For: 1.12 > > > Code requiring Java 7 doesn't need to be in a separate module now that > TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1706: -- Attachment: TIKA-1706-2.patch TIKA-1706-1.patch A proposed patch per [~grossws]'s suggestion from the dev mailing list - The first patch contains the following: - creation of the secondary jar using maven-shade-plugin: -- used the *uber* classifier using alternatives: shaded, nodep, all, etc. Which one is best? -- commons-io shaded under {{shaded.commons-io.$\{commons.io.version\}.org.apache.commons.io}} to avoid potential conflicts with other commons-io-shading dependencies e.g. as in org.ops4j.pax.url:pax-url-aether:2.3.0 -- automatic removal of unused classes using - deprecated all classes that were copied from commons-io and modified them to extend their new counterparts - deprecated all constructors - removed all identical or functionally identical methods - modified all remaining methods to call alternative existing jdk/commons-io methods, deprecated them and refered to the used alternatives _*Note: this was done only in IOUtils, where many methods that has the same signature as the ones in commons-io were modified along the way to use UTF-8 instead of the platform default._ - all things should remain backward-compatible, except one: org.apache.tika.io.TaggedIOException(IOException, Object) will now throw a ClassCastException if the Object is not Serializable The second patch contains trivial import changes in tika-core from org.apache.tika.io to org.apache.commons.io > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream
[ https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1744: -- Attachment: TIKA-1744-2.patch Additional minor patch: - Corrected javadoc links - Added {{@Deprecated}} annotations to methods where {{@deprecated}} javadoc tags were added > Use java.nio.file.Path in TikaInputStream > - > > Key: TIKA-1744 > URL: https://issues.apache.org/jira/browse/TIKA-1744 > Project: Tika > Issue Type: Sub-task > Components: core >Reporter: Yaniv Kunda >Assignee: Tim Allison >Priority: Minor > Labels: java7 > Fix For: 1.11 > > Attachments: TIKA-1744-2.patch, TIKA-1744.patch > > > This will provide support for the new api for users who need it, and provide > better information in I/O operations, e.g. detailed exception if file cannot > be read. > - used Path and methods in java.nio.file.Files internally > - add getPath() method as the counterpart to getFile() > - modified test to use -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
[ https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938914#comment-14938914 ] Yaniv Kunda commented on TIKA-1757: --- Also, regarding the badness of {{URL#getFile()}} - on Windows machines it returns a String starting with a slash - e.g. {{/C:\File.txt}}. This, for some reason, when passed to a {{File}} constructor, is handled in a lenient manner, and the preceding slash disappears - unlike {{Paths.get(String)}} fails with a {{InvalidPathException}}. > tika-batch tests fail on systems with whitespace or special chars in folder > name > > > Key: TIKA-1757 > URL: https://issues.apache.org/jira/browse/TIKA-1757 > Project: Tika > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Tim Allison > Attachments: TIKA-1757.patch > > > This is one problem that forbiddenapis des not catch, because the method > affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both > return the URL path, which should never be treated as a file system path (for > file: URLs). This is breaks asap, if the path contains special characters > which may not be part of URL. getFile() and getPath() return the encoded path. > The correct way to transform a file URL to a file is: {{new > File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven > community for Mojos/Plugins. > In fact the affected test should not use a file at all. Instead it should use > {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name
[ https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938908#comment-14938908 ] Yaniv Kunda commented on TIKA-1757: --- If one needs a java.nio.file.Path, {{Paths.get(url.toURI())}} can be used instead. > tika-batch tests fail on systems with whitespace or special chars in folder > name > > > Key: TIKA-1757 > URL: https://issues.apache.org/jira/browse/TIKA-1757 > Project: Tika > Issue Type: Bug >Reporter: Uwe Schindler >Assignee: Tim Allison > Attachments: TIKA-1757.patch > > > This is one problem that forbiddenapis des not catch, because the method > affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both > return the URL path, which should never be treated as a file system path (for > file: URLs). This is breaks asap, if the path contains special characters > which may not be part of URL. getFile() and getPath() return the encoded path. > The correct way to transform a file URL to a file is: {{new > File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven > community for Mojos/Plugins. > In fact the affected test should not use a file at all. Instead it should use > {{Class#getResourceAsStream()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig
[ https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1751: -- Attachment: TIKA-1751.patch Updated patch to latest changes. > Use java.nio.file.Path in TikaConfig > > > Key: TIKA-1751 > URL: https://issues.apache.org/jira/browse/TIKA-1751 > Project: Tika > Issue Type: Sub-task > Components: config >Reporter: Yaniv Kunda >Priority: Minor > Labels: java7 > Fix For: 1.11 > > Attachments: TIKA-1751.patch > > > Provide constructors accepting java.nio.file.Path -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig
[ https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1751: -- Attachment: (was: TIKA-1751.patch) > Use java.nio.file.Path in TikaConfig > > > Key: TIKA-1751 > URL: https://issues.apache.org/jira/browse/TIKA-1751 > Project: Tika > Issue Type: Sub-task > Components: config >Reporter: Yaniv Kunda >Priority: Minor > Labels: java7 > Fix For: 1.11 > > Attachments: TIKA-1751.patch > > > Provide constructors accepting java.nio.file.Path -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path
[ https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938972#comment-14938972 ] Yaniv Kunda commented on TIKA-1758: --- Not a hard requirement - can be avoided by converting a Path back to a File (or to a String). > BatchCommandLineBuilder fails on systems with whitespace in path > > > Key: TIKA-1758 > URL: https://issues.apache.org/jira/browse/TIKA-1758 > Project: Tika > Issue Type: Bug > Components: cli >Reporter: Uwe Schindler > Attachments: TIKA-1758.patch > > > All tests for CLI module fail with errors like that: > {noformat} > Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< > FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL > ineTest > testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest) Time > elapsed: 0.026 sec <<< ERROR! > java.nio.file.InvalidPathException: Illegal char <"> at index 0: > "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput" > at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) > at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94) > at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255) > at java.nio.file.Paths.get(Paths.java:84) > at > org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137) > at > org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51) > at > org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127) > {noformat} > The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? > If you use ProcessBuilder you don't need that! Not sure what this should do, > but the problem is: The first argument (the executable) contains quotes after > the method transformed it and breaks the test. > I have no idea how to fix this, but the quotes should not be in a String[] > command line at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path
[ https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1758: -- Attachment: TIKA-1758.patch A patch containing a fix (and more File->Path migration), requires TIKA-1751. > BatchCommandLineBuilder fails on systems with whitespace in path > > > Key: TIKA-1758 > URL: https://issues.apache.org/jira/browse/TIKA-1758 > Project: Tika > Issue Type: Bug > Components: cli >Reporter: Uwe Schindler > Attachments: TIKA-1758.patch > > > All tests for CLI module fail with errors like that: > {noformat} > Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< > FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL > ineTest > testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest) Time > elapsed: 0.026 sec <<< ERROR! > java.nio.file.InvalidPathException: Illegal char <"> at index 0: > "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput" > at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182) > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153) > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77) > at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94) > at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255) > at java.nio.file.Paths.get(Paths.java:84) > at > org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137) > at > org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51) > at > org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127) > {noformat} > The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? > If you use ProcessBuilder you don't need that! Not sure what this should do, > but the problem is: The first argument (the executable) contains quotes after > the method transformed it and breaks the test. > I have no idea how to fix this, but the quotes should not be in a String[] > command line at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1750) CachedTranslator.isAvailable() throws NPE when underlying translator is null
[ https://issues.apache.org/jira/browse/TIKA-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1750: -- Attachment: TIKA-1750.patch > CachedTranslator.isAvailable() throws NPE when underlying translator is null > > > Key: TIKA-1750 > URL: https://issues.apache.org/jira/browse/TIKA-1750 > Project: Tika > Issue Type: Bug > Components: translation >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1750.patch > > > When initialized with no underlying translator, CachedTranslator throws NPE > when calling isAvailable(), although a user should initialize the translator > (as it says in the default constructor's javadoc), it doesn't always happen > and since CachedTranslator is defined as a registered service in > tika-translate\src\main\resources\META-INF\services\org.apache.tika.language.translate.Translator, > it normally doesn't (causing DumpTikaConfigExampleTest to fail). > Since CachedTranslator is returning the source text when calling > translate(String, String, String) when the translator is null, it makes sense > that isAvailable returns false under the same condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1750) CachedTranslator.isAvailable() throws NPE when underlying translator is null
Yaniv Kunda created TIKA-1750: - Summary: CachedTranslator.isAvailable() throws NPE when underlying translator is null Key: TIKA-1750 URL: https://issues.apache.org/jira/browse/TIKA-1750 Project: Tika Issue Type: Bug Components: translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 When initialized with no underlying translator, CachedTranslator throws NPE when calling isAvailable(), although a user should initialize the translator (as it says in the default constructor's javadoc), it doesn't always happen and since CachedTranslator is defined as a registered service in tika-translate\src\main\resources\META-INF\services\org.apache.tika.language.translate.Translator, it normally doesn't (causing DumpTikaConfigExampleTest to fail). Since CachedTranslator is returning the source text when calling translate(String, String, String) when the translator is null, it makes sense that isAvailable returns false under the same condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1751) Use java.nio.file.Path in TikaConfig
Yaniv Kunda created TIKA-1751: - Summary: Use java.nio.file.Path in TikaConfig Key: TIKA-1751 URL: https://issues.apache.org/jira/browse/TIKA-1751 Project: Tika Issue Type: Sub-task Components: config Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Provide constructors accepting java.nio.file.Path -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig
[ https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1751: -- Attachment: TIKA-1751.patch > Use java.nio.file.Path in TikaConfig > > > Key: TIKA-1751 > URL: https://issues.apache.org/jira/browse/TIKA-1751 > Project: Tika > Issue Type: Sub-task > Components: config >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1751.patch > > > Provide constructors accepting java.nio.file.Path -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1752) Use java.nio.file.Path in org.apache.tika.detect
Yaniv Kunda created TIKA-1752: - Summary: Use java.nio.file.Path in org.apache.tika.detect Key: TIKA-1752 URL: https://issues.apache.org/jira/browse/TIKA-1752 Project: Tika Issue Type: Sub-task Components: detector Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Add constructors and methods accepting java.nio.file.Path to TrainedModelDetector & Son. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1734) Use java.nio.file.Path in TemporaryResources
[ https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1734: -- Labels: java7 (was: ) > Use java.nio.file.Path in TemporaryResources > > > Key: TIKA-1734 > URL: https://issues.apache.org/jira/browse/TIKA-1734 > Project: Tika > Issue Type: Sub-task > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Labels: java7 > Fix For: 1.11 > > Attachments: TIKA-1734.patch > > > This will provide support for the new api for uses who need it, and provide > better information in I/O operations, e.g. detailed exception if temporary > file deletion fails. > - used Path and methods in java.nio.file.Files internally > - add setTemporaryFileDirectory(Path) method > - add createTempFile() method (mimicking Files.createTempFile) > - add unit test for proper deletion of temp files -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1752) Use java.nio.file.Path in org.apache.tika.detect
[ https://issues.apache.org/jira/browse/TIKA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1752: -- Labels: java7 (was: ) > Use java.nio.file.Path in org.apache.tika.detect > > > Key: TIKA-1752 > URL: https://issues.apache.org/jira/browse/TIKA-1752 > Project: Tika > Issue Type: Sub-task > Components: detector >Reporter: Yaniv Kunda >Priority: Minor > Labels: java7 > Fix For: 1.11 > > Attachments: TIKA-1752.patch > > > Add constructors and methods accepting java.nio.file.Path to > TrainedModelDetector & Son. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path
[ https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1746: -- Labels: java7 (was: ) > modify TikaFileTypeDetector to use new detect method accepting > java.nio.file.Path > - > > Key: TIKA-1746 > URL: https://issues.apache.org/jira/browse/TIKA-1746 > Project: Tika > Issue Type: Sub-task > Components: detector >Reporter: Yaniv Kunda >Priority: Minor > Labels: java7 > Fix For: 1.11 > > Attachments: TIKA-1746.patch > > > Utilize the new org.apache.tika.Tika.detect(Path) method -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream
[ https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1744: -- Attachment: TIKA-1744.patch > Use java.nio.file.Path in TikaInputStream > - > > Key: TIKA-1744 > URL: https://issues.apache.org/jira/browse/TIKA-1744 > Project: Tika > Issue Type: Sub-task > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1744.patch > > > This will provide support for the new api for users who need it, and provide > better information in I/O operations, e.g. detailed exception if file cannot > be read. > - used Path and methods in java.nio.file.Files internally > - add getPath() method as the counterpart to getFile() > - modified test to use -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader
[ https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1745: -- Attachment: TIKA-1745.patch > Add methods accepting java.nio.file.Path to org.apache.tika.Tika and > org.apache.tika.parser.ParsingReader > - > > Key: TIKA-1745 > URL: https://issues.apache.org/jira/browse/TIKA-1745 > Project: Tika > Issue Type: Sub-task > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1745.patch > > > Add methods accepting java.nio.file.Path to complement those accepting > java.io.File, using the new methods in TikaInputStream or java.nio.file.Files -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path
[ https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1746: -- Attachment: TIKA-1746.patch > modify TikaFileTypeDetector to use new detect method accepting > java.nio.file.Path > - > > Key: TIKA-1746 > URL: https://issues.apache.org/jira/browse/TIKA-1746 > Project: Tika > Issue Type: Sub-task > Components: detector >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1746.patch > > > Utilize the new org.apache.tika.Tika.detect(Path) method -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar
[ https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1738: -- Attachment: TIKA-1738.patch This patch moves the bootstrap jar creation to be static and happen only once in the class initialization. Deletion is done using a single shutdown hook, which will *probably* do its job, if no handle created by a forked process still references the file - i.e. if enough time has passed since the last forked process was destroyed and the JVM was shutdown. It also uses java.nio.file instead of the old java.io package. Added benefit: performance is better since forked process do not need to create the bootstrap jar all over again. Added drawback: if temp jar is deleted between forks future forks would fail. > ForkClient does not always delete temporary bootstrap jar > - > > Key: TIKA-1738 > URL: https://issues.apache.org/jira/browse/TIKA-1738 > Project: Tika > Issue Type: Bug > Components: core > Environment: Windows 10 >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1738.patch > > > ForkClient creates a new temporary bootstrap jar each time it's instantiated, > and tries to delete it in the {{close()}} method, after destroying the > process. > Possibly a Windows-specific behavior, the OS seem to still hold a handle to > the file a bit after the process is destroyed, causing the delete() method to > do nothing. > This is recreated by simply running ForkParserTest on my machine. > In a long-running process,this could fill the temp folder with many bootstrap > jars that will never be deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1734) Use java.nio.file.Path in TemporaryResources
Yaniv Kunda created TIKA-1734: - Summary: Use java.nio.file.Path in TemporaryResources Key: TIKA-1734 URL: https://issues.apache.org/jira/browse/TIKA-1734 Project: Tika Issue Type: Sub-task Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 This will provide support for the new api for uses who need it, and provide better information in I/O operations, e.g. detailed exception if temporary file deletion fails. - used Path and methods in java.nio.file.Files internally - add setTemporaryFileDirectory(Path) method - add createTempFile() method (mimicking Files.createTempFile) - add unit test for proper deletion of temp files -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1734) Use java.nio.file.Path in TemporaryResources
[ https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1734: -- Attachment: TIKA-1734.patch > Use java.nio.file.Path in TemporaryResources > > > Key: TIKA-1734 > URL: https://issues.apache.org/jira/browse/TIKA-1734 > Project: Tika > Issue Type: Sub-task > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > Attachments: TIKA-1734.patch > > > This will provide support for the new api for uses who need it, and provide > better information in I/O operations, e.g. detailed exception if temporary > file deletion fails. > - used Path and methods in java.nio.file.Files internally > - add setTemporaryFileDirectory(Path) method > - add createTempFile() method (mimicking Files.createTempFile) > - add unit test for proper deletion of temp files -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path
[ https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1726: -- Description: In light of Java 7 already EOL, it's high time we add support for the new java.nio.file.Path class introduced with it, which, together with support methods in java.nio.file.Files and others, provide a better file I/O framework than java.io.File. In just two cases, we have public methods in tika that only return a File object, and cannot be overloaded, so a different name for the new method must be created: - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}} _Suggestions:_ -- addTemporaryFile -- addTempFile -- createTempFile - {{org.apache.tika.io.TikaInputStream#getFile()}} _Suggestions:_ -- asFile -- toPath -- getPath In other cases, the methods accept a File as an argument, and should remain as tika users might be using them - so an overloaded method that accepts a Path instead should be added, referencing the new method from the old one using (using the @see tag) deprecating the old method until an unknown tika major release. Here is the full list of other methods: _tika-app:_ - {{org.apache.tika.gui.TikaGUI#openFile(File)}} _tika-batch:_ - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, HANDLE_EXISTING, String)}} - {{org.apache.tika.util.PropsUtil#getFile(String, File)}} - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors - {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}} - {{org.apache.tika.batch.fs.FSFileResource}} constructor - {{org.apache.tika.batch.fs.FSListCrawler}} constructor - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, File)}} - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}} - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor _tika-core:_ - {{org.apache.tika.Tika#detect(File)}} - {{org.apache.tika.Tika#parse(File)}} - {{org.apache.tika.Tika#parseToString(File)}} - {{org.apache.tika.config.TikaConfig}} constructors - {{org.apache.tika.detect.NNExampleModelDetector}} constructor - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}} - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}} - {{org.apache.tika.io.TikaInputStream#get(File)}} - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}} _tika-parsers:_ - {{org.apache.tika.parser.ParsingReader}} constructor - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}} - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}} - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor _tika-translate:_ - {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, String[], File)}} Due to lack of evidence, all public methods in public non-test classes (and not in tika-example) are deemed part of a public API - although there's no formal definition of such. If anyone knows of a public method which isn't accessed publicly and can be defined as package-private, or for another reason, please comment. was: In light of Java 7 already EOL, it's high time we add support for the new java.nio.file.Path class introduced with it, which, together with support methods in java.nio.file.Files and others, provide a better file I/O framework than java.io.File. In just two cases, we have public methods in tika that only return a File object, and cannot be overloaded, so a different name for the new method must be created: - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}} _Suggestions:_ -- addTemporaryFile -- addTempFile -- createTempFile - {{org.apache.tika.io.TikaInputStream#getFile()}} _Suggestions:_ -- asFile -- toPath -- getPath In other cases, the methods accept a File as an argument, and should remain as tika users might be using them - so an overloaded method that accepts a Path instead should be added, deprecating the old method until an unknown tika major release. Here is the full list of other methods: _tika-app:_ - {{org.apache.tika.gui.TikaGUI#openFile(File)}} _tika-batch:_ - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, HANDLE_EXISTING, String)}} - {{org.apache.tika.util.PropsUtil#getFile(String, File)}} - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors - {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}} - {{org.apache.tika.batch.fs.FSFileResource}} constructor - {{org.apache.tika.batch.fs.FSListCrawler}} constructor - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, File)}} - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}} - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path
[ https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1726: -- Description: In light of Java 7 already EOL, it's high time we add support for the new java.nio.file.Path class introduced with it, which, together with support methods in java.nio.file.Files and others, provide a better file I/O framework than java.io.File. In just two cases, we have public methods in tika that only return a File object, and cannot be overloaded, so a different name for the new method must be created: - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}} _Suggestions:_ -- addTemporaryFile -- addTempFile -- createTempFile -- createTemporaryPath - {{org.apache.tika.io.TikaInputStream#getFile()}} _Suggestions:_ -- asFile -- toPath -- getPath In other cases, the methods accept a File as an argument, and should remain as tika users might be using them - so an overloaded method that accepts a Path instead should be added, referencing the new method from the old one (using the @see tag) until java.io.File itself is deprecated or otherwise becomes obsolete. Here is the full list of other methods: _tika-app:_ - {{org.apache.tika.gui.TikaGUI#openFile(File)}} _tika-batch:_ - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, HANDLE_EXISTING, String)}} - {{org.apache.tika.util.PropsUtil#getFile(String, File)}} - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors - {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}} - {{org.apache.tika.batch.fs.FSFileResource}} constructor - {{org.apache.tika.batch.fs.FSListCrawler}} constructor - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, File)}} - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}} - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor _tika-core:_ - {{org.apache.tika.Tika#detect(File)}} - {{org.apache.tika.Tika#parse(File)}} - {{org.apache.tika.Tika#parseToString(File)}} - {{org.apache.tika.config.TikaConfig}} constructors - {{org.apache.tika.detect.NNExampleModelDetector}} constructor - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}} - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}} - {{org.apache.tika.io.TikaInputStream#get(File)}} - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}} _tika-parsers:_ - {{org.apache.tika.parser.ParsingReader}} constructor - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}} - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}} - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor _tika-translate:_ - {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, String[], File)}} Due to lack of evidence, all public methods in public non-test classes (and not in tika-example) are deemed part of a public API - although there's no formal definition of such. If anyone knows of a public method which isn't accessed publicly and can be defined as package-private, or for another reason, please comment. was: In light of Java 7 already EOL, it's high time we add support for the new java.nio.file.Path class introduced with it, which, together with support methods in java.nio.file.Files and others, provide a better file I/O framework than java.io.File. In just two cases, we have public methods in tika that only return a File object, and cannot be overloaded, so a different name for the new method must be created: - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}} _Suggestions:_ -- addTemporaryFile -- addTempFile -- createTempFile - {{org.apache.tika.io.TikaInputStream#getFile()}} _Suggestions:_ -- asFile -- toPath -- getPath In other cases, the methods accept a File as an argument, and should remain as tika users might be using them - so an overloaded method that accepts a Path instead should be added, referencing the new method from the old one using (using the @see tag) deprecating the old method until an unknown tika major release. Here is the full list of other methods: _tika-app:_ - {{org.apache.tika.gui.TikaGUI#openFile(File)}} _tika-batch:_ - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, HANDLE_EXISTING, String)}} - {{org.apache.tika.util.PropsUtil#getFile(String, File)}} - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors - {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}} - {{org.apache.tika.batch.fs.FSFileResource}} constructor - {{org.apache.tika.batch.fs.FSListCrawler}} constructor - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, File)}} -
[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path
[ https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725769#comment-14725769 ] Yaniv Kunda commented on TIKA-1726: --- Funny you proposed those two alternatives - exactly what I started with... But compared to the methods in the Java platform it seems partly incorrect as these mostly deal with files, located using paths, e.g. Files.createTempFile. So for createTemporaryFile I think that createTemporaryPath is problematic in the sense that an actual file is created, not just a path. I suggested the add* variants to hint that the file is added to the list of resources to close, as in addResource. For getFile, getPath is actually pretty ok but I think both are problematic in that they look like a getter - I wanted to signify its write-to-file functionality. How about save/store/persist? Regarding deprecation, not a problem - I'll drop it and add a @see tag from the old method to the new one (but not the other way round?). In both questions, my suggestions are only suggestions, and my reservations are only reservations - but if you can take a call and make any decision I'd be happy to accept it and move this forward. > Augment public methods that use a java.io.File with methods that use a > java.nio.file.Path > - > > Key: TIKA-1726 > URL: https://issues.apache.org/jira/browse/TIKA-1726 > Project: Tika > Issue Type: Improvement > Components: batch, core, gui, parser, translation >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.11 > > > In light of Java 7 already EOL, it's high time we add support for the new > java.nio.file.Path class introduced with it, which, together with support > methods in java.nio.file.Files and others, provide a better file I/O > framework than java.io.File. > In just two cases, we have public methods in tika that only return a File > object, and cannot be overloaded, so a different name for the new method must > be created: > - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}} > _Suggestions:_ > -- addTemporaryFile > -- addTempFile > -- createTempFile > - {{org.apache.tika.io.TikaInputStream#getFile()}} > _Suggestions:_ > -- asFile > -- toPath > -- getPath > In other cases, the methods accept a File as an argument, and should remain > as tika users might be using them - so an overloaded method that accepts a > Path instead should be added, deprecating the old method until an unknown > tika major release. > Here is the full list of other methods: > _tika-app:_ > - {{org.apache.tika.gui.TikaGUI#openFile(File)}} > _tika-batch:_ > - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, > HANDLE_EXISTING, String)}} > - {{org.apache.tika.util.PropsUtil#getFile(String, File)}} > - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors > - > {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}} > - {{org.apache.tika.batch.fs.FSFileResource}} constructor > - {{org.apache.tika.batch.fs.FSListCrawler}} constructor > - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor > - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, > File)}} > - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}} > - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor > _tika-core:_ > - {{org.apache.tika.Tika#detect(File)}} > - {{org.apache.tika.Tika#parse(File)}} > - {{org.apache.tika.Tika#parseToString(File)}} > - {{org.apache.tika.config.TikaConfig}} constructors > - {{org.apache.tika.detect.NNExampleModelDetector}} constructor > - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}} > - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}} > - {{org.apache.tika.io.TikaInputStream#get(File)}} > - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}} > _tika-parsers:_ > - {{org.apache.tika.parser.ParsingReader}} constructor > - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}} > - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}} > - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor > _tika-translate:_ > - > {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, > String[], File)}} > Due to lack of evidence, all public methods in public non-test classes (and > not in tika-example) are deemed part of a public API - although there's no > formal definition of such. > If anyone knows of a public method which isn't accessed publicly and can be > defined as package-private, or for another reason, please comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path
Yaniv Kunda created TIKA-1726: - Summary: Augment public methods that use a java.io.File with methods that use a java.nio.file.Path Key: TIKA-1726 URL: https://issues.apache.org/jira/browse/TIKA-1726 Project: Tika Issue Type: Improvement Components: batch, core, gui, parser, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 In light of Java 7 already EOL, it's high time we add support for the new java.nio.file.Path class introduced with it, which, together with support methods in java.nio.file.Files and others, provide a better file I/O framework than java.io.File. In just two cases, we have public methods in tika that only return a File object, and cannot be overloaded, so a different name for the new method must be created: - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}} _Suggestions:_ -- addTemporaryFile -- addTempFile -- createTempFile - {{org.apache.tika.io.TikaInputStream#getFile()}} _Suggestions:_ -- asFile -- toPath -- getPath In other cases, the methods accept a File as an argument, and should remain as tika users might be using them - so an overloaded method that accepts a Path instead should be added, deprecating the old method until an unknown tika major release. Here is the full list of other methods: _tika-app:_ - {{org.apache.tika.gui.TikaGUI#openFile(File)}} _tika-batch:_ - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, HANDLE_EXISTING, String)}} - {{org.apache.tika.util.PropsUtil#getFile(String, File)}} - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors - {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}} - {{org.apache.tika.batch.fs.FSFileResource}} constructor - {{org.apache.tika.batch.fs.FSListCrawler}} constructor - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, File)}} - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}} - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor _tika-core:_ - {{org.apache.tika.Tika#detect(File)}} - {{org.apache.tika.Tika#parse(File)}} - {{org.apache.tika.Tika#parseToString(File)}} - {{org.apache.tika.config.TikaConfig}} constructors - {{org.apache.tika.detect.NNExampleModelDetector}} constructor - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}} - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}} - {{org.apache.tika.io.TikaInputStream#get(File)}} - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}} _tika-parsers:_ - {{org.apache.tika.parser.ParsingReader}} constructor - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}} - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}} - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor _tika-translate:_ - {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, String[], File)}} Due to lack of evidence, all public methods in public non-test classes (and not in tika-example) are deemed part of a public API - although there's no formal definition of such. If anyone knows of a public method which isn't accessed publicly and can be defined as package-private, or for another reason, please comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717349#comment-14717349 ] Yaniv Kunda commented on TIKA-1706: --- The fact that o.a.tika.io contains public classes is a problem I didn't think about - these files are strictly meant as internal utility/support classes and shouldn't really be used by users. In fact, I would say although these are public classes, they should not be considered a part of the public API of tika-core. And since we don't know what commons-io-cloned classes users use (probably by accident), it is indeed a problem letting these go. I also think that the no-dependencies principle is more romantic than it is useful, as these days a lot of the Java ecosystem is built on using external libraries, unless space is critical such as in mobile applications (and even these are getting bigger and bigger). As the vast majority of tika-core usages comes transitively from tika-parsers, I think this is not the case. I haven't crawled maven repo (deep enough) to find how many tika-code exclusive usages have a few or no other dependencies, but I suspect that number is not very high. So the absolute worst case here - and remember that this is the extreme case of a library that uses tika-core and no other library - is a 30% footprint increase! o.a.tika.io is a mess - it contains: - classes from commons-io-1.4 - partial classes from commons-io-1.4 - modified classes from commons-io-1.4 - classes from commons-io-2.0 (or later unknown version/s) - tika original classes It's really hard going over all changes - and I've shown just a few examples - but just doing the switch is simply easier, not so costly even in the worst case, and would bring progress to our doorstep (today and in future changes) by exploration faster than maintaining copied code. My suggestion is: - bring commons-io back to tika-core - change all usages of the copied classes to commons-io - deprecate (do not delete) the copied classes, probably until tika-2.0 Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717519#comment-14717519 ] Yaniv Kunda commented on TIKA-1706: --- That's why I suggested to just add commons-io to tika-core, use it internally, and just deprecate the copied classes. Is that ok for 1.x? Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()
Yaniv Kunda created TIKA-1720: - Summary: Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed() Key: TIKA-1720 URL: https://issues.apache.org/jira/browse/TIKA-1720 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TemporaryResource.close() currently collects exceptions throw by trying to close its resources in a list. When the time to propagate an exception comes, information is lost - the thrown exception contains a message with the string descriptions of all exceptions, and the first exception as the cause - there is no stack trace describing what went wrong closing a resource. In addition, the thrown exception is IOExceptionWithCause, copied from commons-io, which is redundant since Java 6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1721) Replace IOExceptionWithCause in ForkClient
Yaniv Kunda created TIKA-1721: - Summary: Replace IOExceptionWithCause in ForkClient Key: TIKA-1721 URL: https://issues.apache.org/jira/browse/TIKA-1721 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 IOExceptionWithCause (copied from commons-io) is redundant since Java 6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()
[ https://issues.apache.org/jira/browse/TIKA-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1720: -- Attachment: TIKA-1720.patch Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed() - Key: TIKA-1720 URL: https://issues.apache.org/jira/browse/TIKA-1720 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1720.patch TemporaryResource.close() currently collects exceptions throw by trying to close its resources in a list. When the time to propagate an exception comes, information is lost - the thrown exception contains a message with the string descriptions of all exceptions, and the first exception as the cause - there is no stack trace describing what went wrong closing a resource. In addition, the thrown exception is IOExceptionWithCause, copied from commons-io, which is redundant since Java 6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL
[ https://issues.apache.org/jira/browse/TIKA-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1722: -- Attachment: TIKA-1722.patch Tika methods that accept a File needlessly convert it to a URL -- Key: TIKA-1722 URL: https://issues.apache.org/jira/browse/TIKA-1722 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1722.patch The following methods: - Tika.detect(File) - Tika.parse(File) - Tika.parseToString(File) Convert the given File to a URL and use the corresponding overloaded method that accepts a URL. This seems like a shortcut, but essentially does the following: # Converts the file to a URI # Converts the URI to a URL # Calls TikaInputStream.get(URL, Metadata), which then performs the following special handling: # Checks if the protocol is file # Tries to convert the URL (back) to a URI # Creates a File around the URI # Checks if file.isFile() # Calls TikaInputStream.get(File, Metadata) The special handling in TikaInputStream.get(URL/URI) is a good optimization for in-the-wild file resources, but for internal uses it can be skipped - making Tika call TikaInputStream.get(File, Metadata) directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL
Yaniv Kunda created TIKA-1722: - Summary: Tika methods that accept a File needlessly convert it to a URL Key: TIKA-1722 URL: https://issues.apache.org/jira/browse/TIKA-1722 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 The following methods: - Tika.detect(File) - Tika.parse(File) - Tika.parseToString(File) Convert the given File to a URL and use the corresponding overloaded method that accepts a URL. This seems like a shortcut, but essentially does the following: # Converts the file to a URI # Converts the URI to a URL # Calls TikaInputStream.get(URL, Metadata), which then performs the following special handling: # Checks if the protocol is file # Tries to convert the URL (back) to a URI # Creates a File around the URI # Checks if file.isFile() # Calls TikaInputStream.get(File, Metadata) The special handling in TikaInputStream.get(URL/URI) is a good optimization for in-the-wild file resources, but for internal uses it can be skipped - making Tika call TikaInputStream.get(File, Metadata) directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build
[ https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1711: -- Attachment: (was: TIKA-1711.patch) Remove java6-activated profile from tika-bundle and move its plugins to default build - Key: TIKA-1711 URL: https://issues.apache.org/jira/browse/TIKA-1711 Project: Tika Issue Type: Bug Components: general Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Since the project now requires Java 7, there's no point in allowing Java 6+ since the build would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build
[ https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1711: -- Summary: Remove java6-activated profile from tika-bundle and move its plugins to default build (was: Modify tika-bundle profile activation to require Java 7) Remove java6-activated profile from tika-bundle and move its plugins to default build - Key: TIKA-1711 URL: https://issues.apache.org/jira/browse/TIKA-1711 Project: Tika Issue Type: Bug Components: general Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Since the project now requires Java 7, there's no point in allowing Java 6+ since the build would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build
[ https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1711: -- Attachment: TIKA-1711.patch Revised patch for the revised purpose Remove java6-activated profile from tika-bundle and move its plugins to default build - Key: TIKA-1711 URL: https://issues.apache.org/jira/browse/TIKA-1711 Project: Tika Issue Type: Bug Components: general Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1711.patch Since the project now requires Java 7, there's no point in allowing Java 6+ since the build would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1719) Utilize try-with-resources where it is trivial
Yaniv Kunda created TIKA-1719: - Summary: Utilize try-with-resources where it is trivial Key: TIKA-1719 URL: https://issues.apache.org/jira/browse/TIKA-1719 Project: Tika Issue Type: Improvement Components: cli, core, example, gui, packaging, parser, server Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 The following type of resource usages: {code} AutoCloseable resource = ...; try { // do something with resource } finally { resource.close(); } {code} {code} AutoCloseable resource = null; try { resource = ...; // do something with resource } finally { if (resource != null) { resource.close(); } } {code} and similar constructs can be trivially replaced with Java 7's try-with-resource statement: {code} try (AutoCloseable resource = ...) { // do something with resource } {code} This brings more concise code with less chance of causing resource leaks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
[ https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705085#comment-14705085 ] Yaniv Kunda commented on TIKA-1710: --- As much as I like Guava (the library, not the fruit) its only use was its com.google.common.baseCharsets class, containing constants for the Charset instances of the standard charsets - same as in Java's StandardCharsets. When I replaced this with the static imports of StandardCharsets, there was no use left. Regarding TaggedInputStream, I wasn't sure what to do - this wrap/cast method was a modification of the original commons-io code, and it was used only once - in RFC822Parser. I think it's a nice-to-have optimization helper method but nothing more - as it only saves the cost of a new TaggedInputStream when the source InputStream is already a TaggedInputStream: the checked tag will behave the same way in the same wrap-try-catch flow. The only other usage of TaggedInputStream in tika (besides by TikaInputStream) is in RTFParser, by using the constructor directly, is actually an empty usage - the TaggedInputStream is constructed and checked in the catch clause, but it is not used in the try block at all: the underlying stream does! Since almost all of tika uses TikaInputStream (which has an advanced version of this helper, ensuring bufferism), my opinion is to refrain from adding a helper method and simply use the constructor directly, for simplicity. Replace usages of classes in org.apache.tika.io with current alternatives - Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1710.patch Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1719) Utilize try-with-resources where it is trivial
[ https://issues.apache.org/jira/browse/TIKA-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1719: -- Attachment: TIKA-1719.patch Utilize try-with-resources where it is trivial -- Key: TIKA-1719 URL: https://issues.apache.org/jira/browse/TIKA-1719 Project: Tika Issue Type: Improvement Components: cli, core, example, gui, packaging, parser, server Reporter: Yaniv Kunda Priority: Minor Labels: easyfix Fix For: 1.11 Attachments: TIKA-1719.patch The following type of resource usages: {code} AutoCloseable resource = ...; try { // do something with resource } finally { resource.close(); } {code} {code} AutoCloseable resource = null; try { resource = ...; // do something with resource } finally { if (resource != null) { resource.close(); } } {code} and similar constructs can be trivially replaced with Java 7's try-with-resource statement: {code} try (AutoCloseable resource = ...) { // do something with resource } {code} This brings more concise code with less chance of causing resource leaks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
[ https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1710: -- Attachment: (was: TIKA-1710.patch) Replace usages of classes in org.apache.tika.io with current alternatives - Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1710.patch Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
[ https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1710: -- Attachment: TIKA-1710.patch Replace usages of classes in org.apache.tika.io with current alternatives - Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1710.patch Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1711) Modify tika-bundle profile activation to require Java 7
Yaniv Kunda created TIKA-1711: - Summary: Modify tika-bundle profile activation to require Java 7 Key: TIKA-1711 URL: https://issues.apache.org/jira/browse/TIKA-1711 Project: Tika Issue Type: Bug Components: general Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Since the project now requires Java 7, there's no point in allowing Java 6+ since the build would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
[ https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1710: -- Attachment: (was: TIKA-1710.patch) Replace usages of classes in org.apache.tika.io with current alternatives - Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
[ https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1710: -- Attachment: TIKA-1710.patch Revised patch without StandardCharsets wildcard static imports Replace usages of classes in org.apache.tika.io with current alternatives - Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1710.patch Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1706: -- Comment: was deleted (was: A patch to bring back commons-io to tika-core and replace all formerly inlined classes.) Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477 ] Yaniv Kunda commented on TIKA-1706: --- I've separated all the related changes besides adding commons-io to tika-core, and opened under TIKA-1710. In addition, the recently added commons-io-unsafe check have now found a couple of more default encoding usages: tika-core: src\main\java\org\apache\tika\Tika.java tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives
Yaniv Kunda created TIKA-1710: - Summary: Replace usages of classes in org.apache.tika.io with current alternatives Key: TIKA-1710 URL: https://issues.apache.org/jira/browse/TIKA-1710 Project: Tika Issue Type: Improvement Components: batch, cli, core, example, gui, parser, server, translation Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Many of the classes in org.apache.tika.io were inlined from commons-io in TIKA-249, but these days most components use commons-io anyway, so in order to clean the dependencies on org.apache.tika.io in preparation of adding commons-io to tika-core, the following can be done: - Replace usages of classes in org.apache.tika.io within non-core components with the corresponding classes in commons-io - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core) - Replace other uses of String encoding names of standard charsets with their corresponding Charsets instances from StandardCharsets (this is logically related to IOUtils as these constants should have been there as UTF_8 was before Java 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaniv Kunda updated TIKA-1706: -- Attachment: TIKA-1706.patch A patch to bring back commons-io to tika-core and replace all formerly inlined classes. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 Attachments: TIKA-1706.patch TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696025#comment-14696025 ] Yaniv Kunda commented on TIKA-1706: --- I agree that generally adding an external dependency to a core module might have an impact, but consider that unlike tika-core, commons-io is a true low-level library: it has no compile-time dependencies and is used by 2500 projects in maven central alone. I believe that copying the code of another library, frozen in time (in this case since 2008), hinders innovation and reduces the chance that anyone will utilize new improvements and fixes in newer commons-io since: # it is disconnected from tika and requires manual discovery and research (if commons-io is used as an external dependency it's easy to find deprecated methods and their replacements using static analysis) # it requires manual maintenance of copying select classes/code It's not easy summing more than 7 years of changes in common-io, but here are some beneficial changes I found along the way: - Use org.apache.commons.io.output.ByteArrayOutputStream instead of java.io.ByteArrayOutputStream (this class is actually not that new, but can benefit many uses and save a lot of byte-copying) - this has been further improved by providing an optimized InputStream from a org.apache.commons.io.output.ByteArrayOutputStream (IO-137) - Allow using Charset instead of String encoding (IO-318) - Use StringBuilderWriter instead of StringWriter to avoid unnecessary synchronization (IO-140) Obviously, I did not propose this change just for the sake of disrupting the peace, but I plan and started a series of patches to utilize newer commons-io, which will follow - each in its own issue - once and if commons-io is added as a dependency to tika-core. Bring back commons-io to tika-core -- Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1706) Bring back commons-io to tika-core
Yaniv Kunda created TIKA-1706: - Summary: Bring back commons-io to tika-core Key: TIKA-1706 URL: https://issues.apache.org/jira/browse/TIKA-1706 Project: Tika Issue Type: Improvement Components: core Reporter: Yaniv Kunda Priority: Minor Fix For: 1.11 TIKA-249 inlined select commons-io classes in order to simplify the dependency tree and save some space. I believe these arguments are weaker nowadays due to the following concerns: - Most of the non-core modules already use commons-io, and since tika-core is usually not used by itself, commons-io is already included with it - Since some modules use both tika-core and commons-io, it's not clear which code should be used - Having the inlined classes causes more maintenance and/or technology debt (which in turn causes more maintenance) - Newer commons-io code utilizes newer platform code, e.g. using Charset objects instead of encoding names, being able to use StringBuilder instead of StringBuffer, and so on. I'll be happy to provide a patch to replace usages of the inlined classes with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)