[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2021-07-19 Thread Yaniv Kunda (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383397#comment-17383397
 ] 

Yaniv Kunda commented on TIKA-1706:
---

What a blast from the past...

Thanks!

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-10-18 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962596#comment-14962596
 ] 

Yaniv Kunda commented on TIKA-1672:
---

Here are some names I suggested:
- tika-java7-spi
- tika-java7-filetypedetector
- tika-java7-detector-spi


> Integrate tika-java7 component
> --
>
> Key: TIKA-1672
> URL: https://issues.apache.org/jira/browse/TIKA-1672
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tyler Palsulich
> Fix For: 1.12
>
>
> Code requiring Java 7 doesn't need to be in a separate module now that 
> TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2015-10-01 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Attachment: TIKA-1706-2.patch
TIKA-1706-1.patch

A proposed patch per [~grossws]'s suggestion from the dev mailing list -
The first patch contains the following:
- creation of the secondary jar using maven-shade-plugin:
-- used the *uber* classifier using 
alternatives: shaded, nodep, all, etc.
Which one is best?
-- commons-io shaded under 
{{shaded.commons-io.$\{commons.io.version\}.org.apache.commons.io}} to avoid 
potential conflicts with other commons-io-shading dependencies e.g. as in 
org.ops4j.pax.url:pax-url-aether:2.3.0
-- automatic removal of unused classes using 
- deprecated all classes that were copied from commons-io and modified them to 
extend their new counterparts 
- deprecated all constructors
- removed all identical or functionally identical methods
- modified all remaining methods to call alternative existing jdk/commons-io 
methods, deprecated them and refered to the used alternatives
_*Note: this was done only in IOUtils, where many methods that has the same 
signature as the ones in commons-io were modified along the way to use UTF-8 
instead of the platform default._
- all things should remain backward-compatible, except one: 
org.apache.tika.io.TaggedIOException(IOException, Object) will now throw a 
ClassCastException if the Object is not Serializable

The second patch contains trivial import changes in tika-core from 
org.apache.tika.io to org.apache.commons.io

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1744:
--
Attachment: TIKA-1744-2.patch

Additional minor patch:
- Corrected javadoc links
- Added {{@Deprecated}} annotations to methods where {{@deprecated}} javadoc 
tags were added

> Use java.nio.file.Path in TikaInputStream
> -
>
> Key: TIKA-1744
> URL: https://issues.apache.org/jira/browse/TIKA-1744
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Assignee: Tim Allison
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1744-2.patch, TIKA-1744.patch
>
>
> This will provide support for the new api for users who need it, and provide 
> better information in I/O operations, e.g. detailed exception if file cannot 
> be read.
> - used Path and methods in java.nio.file.Files internally 
> - add getPath() method as the counterpart to getFile()
> - modified test to use 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938914#comment-14938914
 ] 

Yaniv Kunda commented on TIKA-1757:
---

Also, regarding the badness of {{URL#getFile()}} - on Windows machines it 
returns a String starting with a slash - e.g. {{/C:\File.txt}}.
This, for some reason, when passed to a {{File}} constructor, is handled in a 
lenient manner, and the preceding slash disappears - unlike 
{{Paths.get(String)}} fails with a {{InvalidPathException}}.


> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938908#comment-14938908
 ] 

Yaniv Kunda commented on TIKA-1757:
---

If one needs a java.nio.file.Path, {{Paths.get(url.toURI())}} can be used 
instead.

> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1751:
--
Attachment: TIKA-1751.patch

Updated patch to latest changes.

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1751:
--
Attachment: (was: TIKA-1751.patch)

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path

2015-09-30 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938972#comment-14938972
 ] 

Yaniv Kunda commented on TIKA-1758:
---

Not a hard requirement - can be avoided by converting a Path back to a File (or 
to a String).

> BatchCommandLineBuilder fails on systems with whitespace in path
> 
>
> Key: TIKA-1758
> URL: https://issues.apache.org/jira/browse/TIKA-1758
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Reporter: Uwe Schindler
> Attachments: TIKA-1758.patch
>
>
> All tests for CLI module fail with errors like that:
> {noformat}
> Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
> FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
> ineTest
> testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
> elapsed: 0.026 sec  <<< ERROR!
> java.nio.file.InvalidPathException: Illegal char <"> at index 0: 
> "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput"
> at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
> at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
> at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
> at java.nio.file.Paths.get(Paths.java:84)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
> at 
> org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
> {noformat}
> The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? 
> If you use ProcessBuilder you don't need that! Not sure what this should do, 
> but the problem is: The first argument (the executable) contains quotes after 
> the method transformed it and breaks the test.
> I have no idea how to fix this, but the quotes should not be in a String[] 
> command line at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1758:
--
Attachment: TIKA-1758.patch

A patch containing a fix (and more File->Path migration), requires TIKA-1751.

> BatchCommandLineBuilder fails on systems with whitespace in path
> 
>
> Key: TIKA-1758
> URL: https://issues.apache.org/jira/browse/TIKA-1758
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Reporter: Uwe Schindler
> Attachments: TIKA-1758.patch
>
>
> All tests for CLI module fail with errors like that:
> {noformat}
> Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
> FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
> ineTest
> testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
> elapsed: 0.026 sec  <<< ERROR!
> java.nio.file.InvalidPathException: Illegal char <"> at index 0: 
> "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput"
> at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
> at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
> at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
> at java.nio.file.Paths.get(Paths.java:84)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
> at 
> org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
> {noformat}
> The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? 
> If you use ProcessBuilder you don't need that! Not sure what this should do, 
> but the problem is: The first argument (the executable) contains quotes after 
> the method transformed it and breaks the test.
> I have no idea how to fix this, but the quotes should not be in a String[] 
> command line at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1750) CachedTranslator.isAvailable() throws NPE when underlying translator is null

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1750:
--
Attachment: TIKA-1750.patch

> CachedTranslator.isAvailable() throws NPE when underlying translator is null
> 
>
> Key: TIKA-1750
> URL: https://issues.apache.org/jira/browse/TIKA-1750
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1750.patch
>
>
> When initialized with no underlying translator, CachedTranslator throws NPE 
> when calling isAvailable(), although a user should initialize the translator 
> (as it says in the default constructor's javadoc), it doesn't always happen 
> and since CachedTranslator is defined as a registered service in 
> tika-translate\src\main\resources\META-INF\services\org.apache.tika.language.translate.Translator,
>  it normally doesn't (causing DumpTikaConfigExampleTest to fail).
> Since CachedTranslator is returning the source text when calling 
> translate(String, String, String) when the translator is null, it makes sense 
> that isAvailable returns false under the same condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1750) CachedTranslator.isAvailable() throws NPE when underlying translator is null

2015-09-24 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1750:
-

 Summary: CachedTranslator.isAvailable() throws NPE when underlying 
translator is null
 Key: TIKA-1750
 URL: https://issues.apache.org/jira/browse/TIKA-1750
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


When initialized with no underlying translator, CachedTranslator throws NPE 
when calling isAvailable(), although a user should initialize the translator 
(as it says in the default constructor's javadoc), it doesn't always happen and 
since CachedTranslator is defined as a registered service in 
tika-translate\src\main\resources\META-INF\services\org.apache.tika.language.translate.Translator,
 it normally doesn't (causing DumpTikaConfigExampleTest to fail).

Since CachedTranslator is returning the source text when calling 
translate(String, String, String) when the translator is null, it makes sense 
that isAvailable returns false under the same condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-24 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1751:
-

 Summary: Use java.nio.file.Path in TikaConfig
 Key: TIKA-1751
 URL: https://issues.apache.org/jira/browse/TIKA-1751
 Project: Tika
  Issue Type: Sub-task
  Components: config
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1751:
--
Attachment: TIKA-1751.patch

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1752) Use java.nio.file.Path in org.apache.tika.detect

2015-09-24 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1752:
-

 Summary: Use java.nio.file.Path in org.apache.tika.detect
 Key: TIKA-1752
 URL: https://issues.apache.org/jira/browse/TIKA-1752
 Project: Tika
  Issue Type: Sub-task
  Components: detector
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Add constructors and methods accepting java.nio.file.Path to 
TrainedModelDetector & Son.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1734:
--
Labels: java7  (was: )

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1752) Use java.nio.file.Path in org.apache.tika.detect

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1752:
--
Labels: java7  (was: )

> Use java.nio.file.Path in org.apache.tika.detect
> 
>
> Key: TIKA-1752
> URL: https://issues.apache.org/jira/browse/TIKA-1752
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1752.patch
>
>
> Add constructors and methods accepting java.nio.file.Path to 
> TrainedModelDetector & Son.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1746:
--
Labels: java7  (was: )

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1744:
--
Attachment: TIKA-1744.patch

> Use java.nio.file.Path in TikaInputStream
> -
>
> Key: TIKA-1744
> URL: https://issues.apache.org/jira/browse/TIKA-1744
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1744.patch
>
>
> This will provide support for the new api for users who need it, and provide 
> better information in I/O operations, e.g. detailed exception if file cannot 
> be read.
> - used Path and methods in java.nio.file.Files internally 
> - add getPath() method as the counterpart to getFile()
> - modified test to use 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1745:
--
Attachment: TIKA-1745.patch

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1746:
--
Attachment: TIKA-1746.patch

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2015-09-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1738:
--
Attachment: TIKA-1738.patch

This patch moves the bootstrap jar creation to be static and happen only once 
in the class initialization.
Deletion is done using a single shutdown hook, which will *probably* do its 
job, if no handle created by a forked process still references the file - i.e. 
if enough time has passed since the last forked process was destroyed and the 
JVM was shutdown.

It also uses java.nio.file instead of the old java.io package.

Added benefit: performance is better since forked process do not need to create 
the bootstrap jar all over again.
Added drawback: if temp jar is deleted between forks future forks would fail.

> ForkClient does not always delete temporary bootstrap jar
> -
>
> Key: TIKA-1738
> URL: https://issues.apache.org/jira/browse/TIKA-1738
> Project: Tika
>  Issue Type: Bug
>  Components: core
> Environment: Windows 10
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1738.patch
>
>
> ForkClient creates a new temporary bootstrap jar each time it's instantiated, 
> and tries to delete it in the {{close()}} method, after destroying the 
> process.
> Possibly a Windows-specific behavior, the OS seem to still hold a handle to 
> the file a bit after the process is destroyed, causing the delete() method to 
> do nothing.
> This is recreated by simply running ForkParserTest on my machine.
> In a long-running process,this could fill the temp folder with many bootstrap 
> jars that will never be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-16 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1734:
-

 Summary: Use java.nio.file.Path in TemporaryResources
 Key: TIKA-1734
 URL: https://issues.apache.org/jira/browse/TIKA-1734
 Project: Tika
  Issue Type: Sub-task
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


This will provide support for the new api for uses who need it, and provide 
better information in I/O operations, e.g. detailed exception if temporary file 
deletion fails.

- used Path and methods in java.nio.file.Files internally 
- add setTemporaryFileDirectory(Path) method
- add createTempFile() method (mimicking Files.createTempFile)
- add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-16 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1734:
--
Attachment: TIKA-1734.patch

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-05 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1726:
--
Description: 
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, referencing the new method from the old one using 
(using the @see tag) deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

_tika-core:_
- {{org.apache.tika.Tika#detect(File)}}
- {{org.apache.tika.Tika#parse(File)}}
- {{org.apache.tika.Tika#parseToString(File)}}
- {{org.apache.tika.config.TikaConfig}} constructors
- {{org.apache.tika.detect.NNExampleModelDetector}} constructor
- {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
- {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}

_tika-parsers:_
- {{org.apache.tika.parser.ParsingReader}} constructor
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
- {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor

_tika-translate:_
- 
{{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, 
String[], File)}}

Due to lack of evidence, all public methods in public non-test classes (and not 
in tika-example) are deemed part of a public API - although there's no formal 
definition of such.
If anyone knows of a public method which isn't accessed publicly and can be 
defined as package-private, or for another reason, please comment.


  was:
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor


[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-05 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1726:
--
Description: 
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
-- createTemporaryPath
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, referencing the new method from the old one (using the 
@see tag) until java.io.File itself is deprecated or otherwise becomes obsolete.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

_tika-core:_
- {{org.apache.tika.Tika#detect(File)}}
- {{org.apache.tika.Tika#parse(File)}}
- {{org.apache.tika.Tika#parseToString(File)}}
- {{org.apache.tika.config.TikaConfig}} constructors
- {{org.apache.tika.detect.NNExampleModelDetector}} constructor
- {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
- {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}

_tika-parsers:_
- {{org.apache.tika.parser.ParsingReader}} constructor
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
- {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor

_tika-translate:_
- 
{{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, 
String[], File)}}

Due to lack of evidence, all public methods in public non-test classes (and not 
in tika-example) are deemed part of a public API - although there's no formal 
definition of such.
If anyone knows of a public method which isn't accessed publicly and can be 
defined as package-private, or for another reason, please comment.


  was:
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, referencing the new method from the old one using 
(using the @see tag) deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- 

[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-01 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725769#comment-14725769
 ] 

Yaniv Kunda commented on TIKA-1726:
---

Funny you proposed those two alternatives - exactly what I started with... But 
compared to the methods in the Java platform it seems partly incorrect as these 
mostly deal with files, located using paths, e.g. Files.createTempFile.
So for createTemporaryFile I think that createTemporaryPath is problematic in 
the sense that an actual file is created, not just a path.
I suggested the add* variants to hint that the file is added to the list of 
resources to close, as in addResource.
For getFile, getPath is actually pretty ok but I think both are problematic in 
that they look like a getter - I wanted to signify its write-to-file 
functionality.
How about save/store/persist?

Regarding deprecation, not a problem - I'll drop it and add a @see tag from the 
old method to the new one (but not the other way round?).

In both questions, my suggestions are only suggestions, and my reservations are 
only reservations - but if you can take a call and make any decision I'd be 
happy to accept it and move this forward.



> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, deprecating the old method until an unknown 
> tika major release.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> _tika-parsers:_
> - {{org.apache.tika.parser.ParsingReader}} constructor
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-01 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1726:
-

 Summary: Augment public methods that use a java.io.File with 
methods that use a java.nio.file.Path
 Key: TIKA-1726
 URL: https://issues.apache.org/jira/browse/TIKA-1726
 Project: Tika
  Issue Type: Improvement
  Components: batch, core, gui, parser, translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

_tika-core:_
- {{org.apache.tika.Tika#detect(File)}}
- {{org.apache.tika.Tika#parse(File)}}
- {{org.apache.tika.Tika#parseToString(File)}}
- {{org.apache.tika.config.TikaConfig}} constructors
- {{org.apache.tika.detect.NNExampleModelDetector}} constructor
- {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
- {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}

_tika-parsers:_
- {{org.apache.tika.parser.ParsingReader}} constructor
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
- {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor

_tika-translate:_
- 
{{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, 
String[], File)}}

Due to lack of evidence, all public methods in public non-test classes (and not 
in tika-example) are deemed part of a public API - although there's no formal 
definition of such.
If anyone knows of a public method which isn't accessed publicly and can be 
defined as package-private, or for another reason, please comment.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717349#comment-14717349
 ] 

Yaniv Kunda commented on TIKA-1706:
---

The fact that o.a.tika.io contains public classes is a problem I didn't think 
about -
these files are strictly meant as internal utility/support classes and 
shouldn't really be used by users.
In fact, I would say although these are public classes, they should not be 
considered a part of the public API of tika-core.
And since we don't know what commons-io-cloned classes users use (probably by 
accident), it is indeed a problem letting these go.

I also think that the no-dependencies principle is more romantic than it is 
useful, as these days a lot of the Java ecosystem is built on using external 
libraries, unless space is critical such as in mobile applications (and even 
these are getting bigger and bigger).
As the vast majority of tika-core usages comes transitively from tika-parsers, 
I think this is not the case.
I haven't crawled maven repo (deep enough) to find how many tika-code exclusive 
usages have a few or no other dependencies, but I suspect that number is not 
very high.
So the absolute worst case here - and remember that this is the extreme case of 
a library that uses tika-core and no other library - is a 30% footprint 
increase!

o.a.tika.io is a mess - it contains:
- classes from commons-io-1.4
- partial classes from commons-io-1.4
- modified classes from commons-io-1.4
- classes from commons-io-2.0 (or later unknown version/s)
- tika original classes

It's really hard going over all changes - and I've shown just a few examples - 
but just doing the switch is simply easier, not so costly even in the worst 
case, and would bring progress to our doorstep (today and in future changes) by 
exploration faster than maintaining copied code.

My suggestion is:
- bring commons-io back to tika-core
- change all usages of the copied classes to commons-io
- deprecate (do not delete) the copied classes, probably until tika-2.0




 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717519#comment-14717519
 ] 

Yaniv Kunda commented on TIKA-1706:
---

That's why I suggested to just add commons-io to tika-core, use it internally, 
and just deprecate the copied classes.
Is that ok for 1.x?

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()

2015-08-26 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1720:
-

 Summary: Collect multiple exceptions in TemporaryResources.close() 
using Throwable.addSuppressed()
 Key: TIKA-1720
 URL: https://issues.apache.org/jira/browse/TIKA-1720
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


TemporaryResource.close() currently collects exceptions throw by trying to 
close its resources in a list.
When the time to propagate an exception comes, information is lost - the thrown 
exception contains a message with the string descriptions of all exceptions, 
and the first exception as the cause - there is no stack trace describing what 
went wrong closing a resource.
In addition, the thrown exception is IOExceptionWithCause, copied from 
commons-io, which is redundant since Java 6.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1721) Replace IOExceptionWithCause in ForkClient

2015-08-26 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1721:
-

 Summary: Replace IOExceptionWithCause in ForkClient
 Key: TIKA-1721
 URL: https://issues.apache.org/jira/browse/TIKA-1721
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


IOExceptionWithCause (copied from commons-io) is redundant since Java 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()

2015-08-26 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1720:
--
Attachment: TIKA-1720.patch

 Collect multiple exceptions in TemporaryResources.close() using 
 Throwable.addSuppressed()
 -

 Key: TIKA-1720
 URL: https://issues.apache.org/jira/browse/TIKA-1720
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1720.patch


 TemporaryResource.close() currently collects exceptions throw by trying to 
 close its resources in a list.
 When the time to propagate an exception comes, information is lost - the 
 thrown exception contains a message with the string descriptions of all 
 exceptions, and the first exception as the cause - there is no stack trace 
 describing what went wrong closing a resource.
 In addition, the thrown exception is IOExceptionWithCause, copied from 
 commons-io, which is redundant since Java 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL

2015-08-26 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1722:
--
Attachment: TIKA-1722.patch

 Tika methods that accept a File needlessly convert it to a URL
 --

 Key: TIKA-1722
 URL: https://issues.apache.org/jira/browse/TIKA-1722
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1722.patch


 The following methods:
 - Tika.detect(File)
 - Tika.parse(File)
 - Tika.parseToString(File)
 Convert the given File to a URL and use the corresponding overloaded method 
 that accepts a URL.
 This seems like a shortcut, but essentially does the following:
 # Converts the file to a URI
 # Converts the URI to a URL
 # Calls TikaInputStream.get(URL, Metadata), which then performs the following 
 special handling:
 # Checks if the protocol is file
 # Tries to convert the URL (back) to a URI
 # Creates a File around the URI
 # Checks if file.isFile() 
 # Calls TikaInputStream.get(File, Metadata)
 The special handling in TikaInputStream.get(URL/URI) is a good optimization 
 for in-the-wild file resources, but for internal uses it can be skipped - 
 making Tika call TikaInputStream.get(File, Metadata) directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL

2015-08-26 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1722:
-

 Summary: Tika methods that accept a File needlessly convert it to 
a URL
 Key: TIKA-1722
 URL: https://issues.apache.org/jira/browse/TIKA-1722
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


The following methods:
- Tika.detect(File)
- Tika.parse(File)
- Tika.parseToString(File)

Convert the given File to a URL and use the corresponding overloaded method 
that accepts a URL.
This seems like a shortcut, but essentially does the following:
# Converts the file to a URI
# Converts the URI to a URL
# Calls TikaInputStream.get(URL, Metadata), which then performs the following 
special handling:
# Checks if the protocol is file
# Tries to convert the URL (back) to a URI
# Creates a File around the URI
# Checks if file.isFile() 
# Calls TikaInputStream.get(File, Metadata)

The special handling in TikaInputStream.get(URL/URI) is a good optimization for 
in-the-wild file resources, but for internal uses it can be skipped - making 
Tika call TikaInputStream.get(File, Metadata) directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1711:
--
Attachment: (was: TIKA-1711.patch)

 Remove java6-activated profile from tika-bundle and move its plugins to 
 default build
 -

 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 Since the project now requires Java 7, there's no point in allowing Java 6+ 
 since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1711:
--
Summary: Remove java6-activated profile from tika-bundle and move its 
plugins to default build  (was: Modify tika-bundle profile activation to 
require Java 7)

 Remove java6-activated profile from tika-bundle and move its plugins to 
 default build
 -

 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 Since the project now requires Java 7, there's no point in allowing Java 6+ 
 since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1711:
--
Attachment: TIKA-1711.patch

Revised patch for the revised purpose

 Remove java6-activated profile from tika-bundle and move its plugins to 
 default build
 -

 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1711.patch


 Since the project now requires Java 7, there's no point in allowing Java 6+ 
 since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1719) Utilize try-with-resources where it is trivial

2015-08-20 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1719:
-

 Summary: Utilize try-with-resources where it is trivial
 Key: TIKA-1719
 URL: https://issues.apache.org/jira/browse/TIKA-1719
 Project: Tika
  Issue Type: Improvement
  Components: cli, core, example, gui, packaging, parser, server
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


The following type of resource usages:
{code}
AutoCloseable resource = ...;
try {
// do something with resource
} finally {
resource.close();
}
{code}
{code}
AutoCloseable resource = null;
try {
resource = ...;
// do something with resource
} finally {
if (resource != null) {
resource.close();
}
}
{code}

and similar constructs can be trivially replaced with Java 7's 
try-with-resource statement:
{code}
try (AutoCloseable resource = ...) {
// do something with resource
}
{code}

This brings more concise code with less chance of causing resource leaks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-20 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705085#comment-14705085
 ] 

Yaniv Kunda commented on TIKA-1710:
---

As much as I like Guava (the library, not the fruit) its only use was its 
com.google.common.baseCharsets class, containing constants for the Charset 
instances of the standard charsets - same as in Java's StandardCharsets.
When I replaced this with the static imports of StandardCharsets, there was no 
use left.

Regarding TaggedInputStream, I wasn't sure what to do - this wrap/cast method 
was a modification of the original commons-io code, and it was used only once - 
in RFC822Parser.
I think it's a nice-to-have optimization helper method but nothing more - as it 
only saves the cost of a new TaggedInputStream when the source InputStream is 
already a TaggedInputStream: the checked tag will behave the same way in the 
same wrap-try-catch flow.
The only other usage of TaggedInputStream in tika (besides by TikaInputStream) 
is in RTFParser, by using the constructor directly, is actually an empty usage 
- the TaggedInputStream is constructed and checked in the catch clause, but it 
is not used in the try block at all: the underlying stream does!

Since almost all of tika uses TikaInputStream (which has an advanced version of 
this helper, ensuring bufferism), my opinion is to refrain from adding a helper 
method and simply use the constructor directly, for simplicity. 

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1719) Utilize try-with-resources where it is trivial

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1719:
--
Attachment: TIKA-1719.patch

 Utilize try-with-resources where it is trivial
 --

 Key: TIKA-1719
 URL: https://issues.apache.org/jira/browse/TIKA-1719
 Project: Tika
  Issue Type: Improvement
  Components: cli, core, example, gui, packaging, parser, server
Reporter: Yaniv Kunda
Priority: Minor
  Labels: easyfix
 Fix For: 1.11

 Attachments: TIKA-1719.patch


 The following type of resource usages:
 {code}
 AutoCloseable resource = ...;
 try {
 // do something with resource
 } finally {
 resource.close();
 }
 {code}
 {code}
 AutoCloseable resource = null;
 try {
 resource = ...;
 // do something with resource
 } finally {
 if (resource != null) {
 resource.close();
 }
 }
 {code}
 and similar constructs can be trivially replaced with Java 7's 
 try-with-resource statement:
 {code}
 try (AutoCloseable resource = ...) {
 // do something with resource
 }
 {code}
 This brings more concise code with less chance of causing resource leaks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-17 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: (was: TIKA-1710.patch)

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-17 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: TIKA-1710.patch

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1711) Modify tika-bundle profile activation to require Java 7

2015-08-16 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1711:
-

 Summary: Modify tika-bundle profile activation to require Java 7
 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Since the project now requires Java 7, there's no point in allowing Java 6+ 
since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-16 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: (was: TIKA-1710.patch)

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-16 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: TIKA-1710.patch

Revised patch without StandardCharsets wildcard static imports

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Comment: was deleted

(was: A patch to bring back commons-io to tika-core and replace all formerly 
inlined classes.)

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I've separated all the related changes besides adding commons-io to tika-core, 
and opened under TIKA-1710.
In addition, the recently added commons-io-unsafe check have now found a couple 
of more default encoding usages:
tika-core:   src\main\java\org\apache\tika\Tika.java
tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-15 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1710:
-

 Summary: Replace usages of classes in org.apache.tika.io with 
current alternatives
 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Many of the classes in org.apache.tika.io were inlined from commons-io in 
TIKA-249, but these days most components use commons-io anyway, so in order to 
clean the dependencies on org.apache.tika.io in preparation of adding 
commons-io to tika-core, the following can be done:
- Replace usages of classes in org.apache.tika.io within non-core components 
with the corresponding classes in commons-io
- Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core)
- Replace other uses of String encoding names of standard charsets with their 
corresponding Charsets instances from StandardCharsets (this is logically 
related to IOUtils as these constants should have been there as UTF_8 was 
before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2015-08-14 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Attachment: TIKA-1706.patch

A patch to bring back commons-io to tika-core and replace all formerly inlined 
classes.

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-13 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696025#comment-14696025
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I agree that generally adding an external dependency to a core module might 
have an impact,
but consider that unlike tika-core, commons-io is a true low-level library:
it has no compile-time dependencies and is used by 2500 projects in maven 
central alone.

I believe that copying the code of another library, frozen in time (in this 
case since 2008), hinders innovation and reduces the chance that anyone will 
utilize new improvements and fixes in newer commons-io since:
# it is disconnected from tika and requires manual discovery and research (if 
commons-io is used as an external dependency it's easy to find deprecated 
methods and their replacements using static analysis)
# it requires manual maintenance of copying select classes/code

It's not easy summing more than 7 years of changes in common-io, but here are 
some beneficial changes I found along the way:
- Use org.apache.commons.io.output.ByteArrayOutputStream instead of 
java.io.ByteArrayOutputStream (this class is actually not that new, but can 
benefit many uses and save a lot of byte-copying) - this has been further 
improved by providing an optimized InputStream from a 
org.apache.commons.io.output.ByteArrayOutputStream (IO-137)
- Allow using Charset instead of String encoding (IO-318)
- Use StringBuilderWriter instead of StringWriter to avoid unnecessary 
synchronization (IO-140)

Obviously, I did not propose this change just for the sake of disrupting the 
peace, but I plan and started a series of patches to utilize newer commons-io, 
which will follow - each in its own issue - once and if commons-io is added as 
a dependency to tika-core.


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1706) Bring back commons-io to tika-core

2015-08-12 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1706:
-

 Summary: Bring back commons-io to tika-core
 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


TIKA-249 inlined select commons-io classes in order to simplify the dependency 
tree and save some space.
I believe these arguments are weaker nowadays due to the following concerns:
- Most of the non-core modules already use commons-io, and since tika-core is 
usually not used by itself, commons-io is already included with it
- Since some modules use both tika-core and commons-io, it's not clear which 
code should be used
- Having the inlined classes causes more maintenance and/or technology debt 
(which in turn causes more maintenance)
- Newer commons-io code utilizes newer platform code, e.g. using Charset 
objects instead of encoding names, being able to use StringBuilder instead of 
StringBuffer, and so on.

I'll be happy to provide a patch to replace usages of the inlined classes with 
commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)