[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2021-07-19 Thread Yaniv Kunda (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383397#comment-17383397
 ] 

Yaniv Kunda commented on TIKA-1706:
---

What a blast from the past...

Thanks!

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


RE: [jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2015-11-26 Thread Yaniv Kunda
It’s been almost two months since I provided my patches for this –

Can a committer please review and submit?





*From:* Yaniv Kunda [mailto:yaniv.ku...@answers.com]
*Sent:* Monday, October 12, 2015 23:08
*To:* dev@tika.apache.org
*Subject:* Re: [jira] [Updated] (TIKA-1706) Bring back commons-io to
tika-core



Is this solution applicable?
I have some improvements waiting for this.

On Oct 1, 2015 5:57 PM, "Yaniv Kunda (JIRA)" <j...@apache.org> wrote:


 [
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yaniv Kunda updated TIKA-1706:
--
Attachment: TIKA-1706-2.patch
TIKA-1706-1.patch

A proposed patch per [~grossws]'s suggestion from the dev mailing list -
The first patch contains the following:
- creation of the secondary jar using maven-shade-plugin:
-- used the *uber* classifier using 
alternatives: shaded, nodep, all, etc.
Which one is best?
-- commons-io shaded under {{shaded.commons-io.$\{commons.io.version\}.
org.apache.commons.io}} to avoid potential conflicts with other
commons-io-shading dependencies e.g. as in
org.ops4j.pax.url:pax-url-aether:2.3.0
-- automatic removal of unused classes using 
- deprecated all classes that were copied from commons-io and modified them
to extend their new counterparts
- deprecated all constructors
- removed all identical or functionally identical methods
- modified all remaining methods to call alternative existing
jdk/commons-io methods, deprecated them and refered to the used alternatives
_*Note: this was done only in IOUtils, where many methods that has the same
signature as the ones in commons-io were modified along the way to use
UTF-8 instead of the platform default._
- all things should remain backward-compatible, except one:
org.apache.tika.io.TaggedIOException(IOException, Object) will now throw a
ClassCastException if the Object is not Serializable

The second patch contains trivial import changes in tika-core from
org.apache.tika.io to org.apache.commons.io

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the
dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following
concerns:
> - Most of the non-core modules already use commons-io, and since
tika-core is usually not used by itself, commons-io is already included
with it
> - Since some modules use both tika-core and commons-io, it's not clear
which code should be used
> - Having the inlined classes causes more maintenance and/or technology
debt (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset
objects instead of encoding names, being able to use StringBuilder instead
of StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes
with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-10-18 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962596#comment-14962596
 ] 

Yaniv Kunda commented on TIKA-1672:
---

Here are some names I suggested:
- tika-java7-spi
- tika-java7-filetypedetector
- tika-java7-detector-spi


> Integrate tika-java7 component
> --
>
> Key: TIKA-1672
> URL: https://issues.apache.org/jira/browse/TIKA-1672
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tyler Palsulich
> Fix For: 1.12
>
>
> Code requiring Java 7 doesn't need to be in a separate module now that 
> TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-10-18 Thread Yaniv Kunda
This (and https://issues.apache.org/jira/browse/TIKA-1746 and
https://issues.apache.org/jira/browse/TIKA-1751) are part of
https://issues.apache.org/jira/browse/TIKA-1726 and already have relatively
simple patches ready to be committed.

I think they'd be better off committed together with their already-committed
siblings, for putting all API additions in 1.11.

(I'd also like to see https://issues.apache.org/jira/browse/TIKA-1706 in
1.11, which I have prepared patches for according to [~grossws]'s
suggestion, but that's another story...)

-Original Message-
From: Chris A. Mattmann (JIRA) [mailto:j...@apache.org]
Sent: Sunday, October 18, 2015 22:44
To: dev@tika.apache.org
Subject: [jira] [Updated] (TIKA-1745) Add methods accepting
java.nio.file.Path to org.apache.tika.Tika and
org.apache.tika.parser.ParsingReader


 [
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-1745:

Fix Version/s: (was: 1.11)
   1.12

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.12
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting
> java.io.File, using the new methods in TikaInputStream or
> java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


Re: [jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2015-10-12 Thread Yaniv Kunda
Is this solution applicable?
I have some improvements waiting for this.
On Oct 1, 2015 5:57 PM, "Yaniv Kunda (JIRA)" <j...@apache.org> wrote:

>
>  [
> https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Yaniv Kunda updated TIKA-1706:
> --
> Attachment: TIKA-1706-2.patch
> TIKA-1706-1.patch
>
> A proposed patch per [~grossws]'s suggestion from the dev mailing list -
> The first patch contains the following:
> - creation of the secondary jar using maven-shade-plugin:
> -- used the *uber* classifier using 
> alternatives: shaded, nodep, all, etc.
> Which one is best?
> -- commons-io shaded under {{shaded.commons-io.$\{commons.io.version\}.
> org.apache.commons.io}} to avoid potential conflicts with other
> commons-io-shading dependencies e.g. as in
> org.ops4j.pax.url:pax-url-aether:2.3.0
> -- automatic removal of unused classes using 
> - deprecated all classes that were copied from commons-io and modified
> them to extend their new counterparts
> - deprecated all constructors
> - removed all identical or functionally identical methods
> - modified all remaining methods to call alternative existing
> jdk/commons-io methods, deprecated them and refered to the used alternatives
> _*Note: this was done only in IOUtils, where many methods that has the
> same signature as the ones in commons-io were modified along the way to use
> UTF-8 instead of the platform default._
> - all things should remain backward-compatible, except one:
> org.apache.tika.io.TaggedIOException(IOException, Object) will now throw a
> ClassCastException if the Object is not Serializable
>
> The second patch contains trivial import changes in tika-core from
> org.apache.tika.io to org.apache.commons.io
>
> > Bring back commons-io to tika-core
> > --
> >
> > Key: TIKA-1706
> > URL: https://issues.apache.org/jira/browse/TIKA-1706
> > Project: Tika
> >  Issue Type: Improvement
> >  Components: core
> >Reporter: Yaniv Kunda
> >Priority: Minor
> > Fix For: 1.11
> >
> > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
> >
> >
> > TIKA-249 inlined select commons-io classes in order to simplify the
> dependency tree and save some space.
> > I believe these arguments are weaker nowadays due to the following
> concerns:
> > - Most of the non-core modules already use commons-io, and since
> tika-core is usually not used by itself, commons-io is already included
> with it
> > - Since some modules use both tika-core and commons-io, it's not clear
> which code should be used
> > - Having the inlined classes causes more maintenance and/or technology
> debt (which in turn causes more maintenance)
> > - Newer commons-io code utilizes newer platform code, e.g. using Charset
> objects instead of encoding names, being able to use StringBuilder instead
> of StringBuffer, and so on.
> > I'll be happy to provide a patch to replace usages of the inlined
> classes with commons-io classes if this is accepted.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2015-10-01 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Attachment: TIKA-1706-2.patch
TIKA-1706-1.patch

A proposed patch per [~grossws]'s suggestion from the dev mailing list -
The first patch contains the following:
- creation of the secondary jar using maven-shade-plugin:
-- used the *uber* classifier using 
alternatives: shaded, nodep, all, etc.
Which one is best?
-- commons-io shaded under 
{{shaded.commons-io.$\{commons.io.version\}.org.apache.commons.io}} to avoid 
potential conflicts with other commons-io-shading dependencies e.g. as in 
org.ops4j.pax.url:pax-url-aether:2.3.0
-- automatic removal of unused classes using 
- deprecated all classes that were copied from commons-io and modified them to 
extend their new counterparts 
- deprecated all constructors
- removed all identical or functionally identical methods
- modified all remaining methods to call alternative existing jdk/commons-io 
methods, deprecated them and refered to the used alternatives
_*Note: this was done only in IOUtils, where many methods that has the same 
signature as the ones in commons-io were modified along the way to use UTF-8 
instead of the platform default._
- all things should remain backward-compatible, except one: 
org.apache.tika.io.TaggedIOException(IOException, Object) will now throw a 
ClassCastException if the Object is not Serializable

The second patch contains trivial import changes in tika-core from 
org.apache.tika.io to org.apache.commons.io

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1744:
--
Attachment: TIKA-1744-2.patch

Additional minor patch:
- Corrected javadoc links
- Added {{@Deprecated}} annotations to methods where {{@deprecated}} javadoc 
tags were added

> Use java.nio.file.Path in TikaInputStream
> -
>
> Key: TIKA-1744
> URL: https://issues.apache.org/jira/browse/TIKA-1744
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>    Reporter: Yaniv Kunda
>Assignee: Tim Allison
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1744-2.patch, TIKA-1744.patch
>
>
> This will provide support for the new api for users who need it, and provide 
> better information in I/O operations, e.g. detailed exception if file cannot 
> be read.
> - used Path and methods in java.nio.file.Files internally 
> - add getPath() method as the counterpart to getFile()
> - modified test to use 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Resolved] (TIKA-1747) Change file->path in tika-batch throughout

2015-09-30 Thread Yaniv Kunda
Tim - I actually had a shelved changelist with improvements almost identical
to what you did for FSBatchTestBase!
I also shared the thought that the utility methods - countChildren,
readFileToString, deleteDirectory, listPaths - should be elsewhere.
Ideally in commons-io, but this will have to wait until it requires Java 7.

How about in the meantime I concentrate them in tika-core in a new utility
class such as org.apache.tika.io.FileUtils or org.apache.tika.io.Files?
This will expose these methods to other Java7-transitioning code (of which I
have plenty almost ready to be delivered), reducing redundant boilerplate
code.

In addition, I think some of these methods could be slightly improved along
the way, and if they're going to a first-class utility class (no pun
intended), I suggest the following names for clarity and consistency:
countChildren -> countEntries (Files.walkFileTree and DirectoryStream refer
to these as entries)
listPaths -> listEntries (ditto, or use listChildren and leave countChildren
as is)
deleteDirectory -> deleteRecursively (just because it can be technically
used to delete a non-directory file, which is actually convenient)
readFileToString -> toString (as in Guava's Files.toString(File, Charset))

-Original Message-
From: Tim Allison (JIRA) [mailto:j...@apache.org]
Sent: Wednesday, September 30, 2015 19:01
To: dev@tika.apache.org
Subject: [jira] [Resolved] (TIKA-1747) Change file->path in tika-batch
throughout


 [
https://issues.apache.org/jira/browse/TIKA-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison resolved TIKA-1747.
---
Resolution: Fixed

r1706060

> Change file->path in tika-batch throughout
> --
>
> Key: TIKA-1747
> URL: https://issues.apache.org/jira/browse/TIKA-1747
> Project: Tika
>  Issue Type: Sub-task
>  Components: batch
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.11
>
>
> Add Path equivalents for File and deprecate File usage in tika-batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938914#comment-14938914
 ] 

Yaniv Kunda commented on TIKA-1757:
---

Also, regarding the badness of {{URL#getFile()}} - on Windows machines it 
returns a String starting with a slash - e.g. {{/C:\File.txt}}.
This, for some reason, when passed to a {{File}} constructor, is handled in a 
lenient manner, and the preceding slash disappears - unlike 
{{Paths.get(String)}} fails with a {{InvalidPathException}}.


> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1757) tika-batch tests fail on systems with whitespace or special chars in folder name

2015-09-30 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938908#comment-14938908
 ] 

Yaniv Kunda commented on TIKA-1757:
---

If one needs a java.nio.file.Path, {{Paths.get(url.toURI())}} can be used 
instead.

> tika-batch tests fail on systems with whitespace or special chars in folder 
> name
> 
>
> Key: TIKA-1757
> URL: https://issues.apache.org/jira/browse/TIKA-1757
> Project: Tika
>  Issue Type: Bug
>Reporter: Uwe Schindler
>Assignee: Tim Allison
> Attachments: TIKA-1757.patch
>
>
> This is one problem that forbiddenapis des not catch, because the method 
> affected has valid use cases: {{URL#getFile()}} or {{URL#getPath()}} both 
> return the URL path, which should never be treated as a file system path (for 
> file: URLs). This is breaks asap, if the path contains special characters 
> which may not be part of URL. getFile() and getPath() return the encoded path.
> The correct way to transform a file URL to a file is: {{new 
> File(url.toURI())}}. See also the list of "bad stuff" as listed by the Maven 
> community for Mojos/Plugins.
> In fact the affected test should not use a file at all. Instead it should use 
> {{Class#getResourceAsStream()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1751:
--
Attachment: TIKA-1751.patch

Updated patch to latest changes.

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>    Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1751:
--
Attachment: (was: TIKA-1751.patch)

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>    Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path

2015-09-30 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938972#comment-14938972
 ] 

Yaniv Kunda commented on TIKA-1758:
---

Not a hard requirement - can be avoided by converting a Path back to a File (or 
to a String).

> BatchCommandLineBuilder fails on systems with whitespace in path
> 
>
> Key: TIKA-1758
> URL: https://issues.apache.org/jira/browse/TIKA-1758
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Reporter: Uwe Schindler
> Attachments: TIKA-1758.patch
>
>
> All tests for CLI module fail with errors like that:
> {noformat}
> Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
> FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
> ineTest
> testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
> elapsed: 0.026 sec  <<< ERROR!
> java.nio.file.InvalidPathException: Illegal char <"> at index 0: 
> "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput"
> at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
> at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
> at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
> at java.nio.file.Paths.get(Paths.java:84)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
> at 
> org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
> {noformat}
> The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? 
> If you use ProcessBuilder you don't need that! Not sure what this should do, 
> but the problem is: The first argument (the executable) contains quotes after 
> the method transformed it and breaks the test.
> I have no idea how to fix this, but the quotes should not be in a String[] 
> command line at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1758) BatchCommandLineBuilder fails on systems with whitespace in path

2015-09-30 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1758:
--
Attachment: TIKA-1758.patch

A patch containing a fix (and more File->Path migration), requires TIKA-1751.

> BatchCommandLineBuilder fails on systems with whitespace in path
> 
>
> Key: TIKA-1758
> URL: https://issues.apache.org/jira/browse/TIKA-1758
> Project: Tika
>  Issue Type: Bug
>  Components: cli
>Reporter: Uwe Schindler
> Attachments: TIKA-1758.patch
>
>
> All tests for CLI module fail with errors like that:
> {noformat}
> Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.048 sec <<< 
> FAILURE! - in org.apache.tika.cli.TikaCLIBatchCommandL
> ineTest
> testTwoDirsNoFlags(org.apache.tika.cli.TikaCLIBatchCommandLineTest)  Time 
> elapsed: 0.026 sec  <<< ERROR!
> java.nio.file.InvalidPathException: Illegal char <"> at index 0: 
> "C:\Users\Uwe Schindler\Projects\TIKA\svn\tika-app\testInput"
> at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
> at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
> at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
> at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
> at java.nio.file.Paths.get(Paths.java:84)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.translateCommandLine(BatchCommandLineBuilder.java:137)
> at 
> org.apache.tika.cli.BatchCommandLineBuilder.build(BatchCommandLineBuilder.java:51)
> at 
> org.apache.tika.cli.TikaCLIBatchCommandLineTest.testTwoDirsNoFlags(TikaCLIBatchCommandLineTest.java:127)
> {noformat}
> The reason is that BatchCommandLineBuilder adds quotes for unknown reasons!? 
> If you use ProcessBuilder you don't need that! Not sure what this should do, 
> but the problem is: The first argument (the executable) contains quotes after 
> the method transformed it and breaks the test.
> I have no idea how to fix this, but the quotes should not be in a String[] 
> command line at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Commented] (TIKA-1748) Upgrade to POI 3.13-final when available

2015-09-27 Thread Yaniv Kunda
9/29 is two days away - the latest available build is 20150924.

-Original Message-
From: gil cattaneo (JIRA) [mailto:j...@apache.org]
Sent: Saturday, September 26, 2015 15:47
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-1748) Upgrade to POI 3.13-final when
available


[
https://issues.apache.org/jira/browse/TIKA-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909254#comment-14909254
]

gil cattaneo commented on TIKA-1748:


hi
i used poi-3.13-20150929
and consequently build fails

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile)
on project tika-parsers: Compilation failure: Compilation failure:
[ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[22,27]
package org.apache.poi.hslf does not exist [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[25,33]
cannot find symbol
[ERROR] symbol:   class MasterSheet
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[26,33]
cannot find symbol
[ERROR] symbol:   class Notes
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[28,33]
cannot find symbol
[ERROR] symbol:   class Picture
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[29,33]
cannot find symbol
[ERROR] symbol:   class Shape
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[30,33]
cannot find symbol
[ERROR] symbol:   class Slide
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[31,33]
cannot find symbol
[ERROR] symbol:   class Table
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[32,33]
cannot find symbol
[ERROR] symbol:   class TableCell
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[33,33]
cannot find symbol
[ERROR] symbol:   class TextRun
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[34,33]
cannot find symbol
[ERROR] symbol:   class TextShape
[ERROR] location: package org.apache.poi.hslf.model [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[35,37]
cannot find symbol
[ERROR] symbol:   class ObjectData
[ERROR] location: package org.apache.poi.hslf.usermodel [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[36,37]
cannot find symbol
[ERROR] symbol:   class PictureData
[ERROR] location: package org.apache.poi.hslf.usermodel [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[37,37]
cannot find symbol
[ERROR] symbol:   class SlideShow
[ERROR] location: package org.apache.poi.hslf.usermodel [ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[178,59]
cannot find symbol
[ERROR] symbol:   class MasterSheet
[ERROR] location: class org.apache.tika.parser.microsoft.HSLFExtractor
[ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[202,62]
cannot find symbol
[ERROR] symbol:   class Table
[ERROR] location: class org.apache.tika.parser.microsoft.HSLFExtractor
[ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[220,60]
cannot find symbol
[ERROR] symbol:   class TextRun
[ERROR] location: class org.apache.tika.parser.microsoft.HSLFExtractor
[ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[241,46]
cannot find symbol
[ERROR] symbol:   class SlideShow
[ERROR] location: class org.apache.tika.parser.microsoft.HSLFExtractor
[ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:[270,47]
cannot find symbol
[ERROR] symbol:   class Slide
[ERROR] location: class org.apache.tika.parser.microsoft.HSLFExtractor
[ERROR]
/BUILD/tika-1.10/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java:[69,49]
incompatible types: java.util.List
cannot be converted to org.apache.poi.xslf.usermodel.XSLFSlide[]
[ERROR]

[jira] [Updated] (TIKA-1750) CachedTranslator.isAvailable() throws NPE when underlying translator is null

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1750:
--
Attachment: TIKA-1750.patch

> CachedTranslator.isAvailable() throws NPE when underlying translator is null
> 
>
> Key: TIKA-1750
> URL: https://issues.apache.org/jira/browse/TIKA-1750
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1750.patch
>
>
> When initialized with no underlying translator, CachedTranslator throws NPE 
> when calling isAvailable(), although a user should initialize the translator 
> (as it says in the default constructor's javadoc), it doesn't always happen 
> and since CachedTranslator is defined as a registered service in 
> tika-translate\src\main\resources\META-INF\services\org.apache.tika.language.translate.Translator,
>  it normally doesn't (causing DumpTikaConfigExampleTest to fail).
> Since CachedTranslator is returning the source text when calling 
> translate(String, String, String) when the translator is null, it makes sense 
> that isAvailable returns false under the same condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1750) CachedTranslator.isAvailable() throws NPE when underlying translator is null

2015-09-24 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1750:
-

 Summary: CachedTranslator.isAvailable() throws NPE when underlying 
translator is null
 Key: TIKA-1750
 URL: https://issues.apache.org/jira/browse/TIKA-1750
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


When initialized with no underlying translator, CachedTranslator throws NPE 
when calling isAvailable(), although a user should initialize the translator 
(as it says in the default constructor's javadoc), it doesn't always happen and 
since CachedTranslator is defined as a registered service in 
tika-translate\src\main\resources\META-INF\services\org.apache.tika.language.translate.Translator,
 it normally doesn't (causing DumpTikaConfigExampleTest to fail).

Since CachedTranslator is returning the source text when calling 
translate(String, String, String) when the translator is null, it makes sense 
that isAvailable returns false under the same condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [jira] [Created] (TIKA-1749) Upgrade, or shade, guava

2015-09-24 Thread Yaniv Kunda
Tika no longer uses Guava - it was removed in r1696860, see
https://issues.apache.org/jira/browse/TIKA-1710?focusedCommentId=14705823=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14705823

We still have some references in tika-bundle's pom, but no dependencies in
any component.

-Original Message-
From: Alexander Pogrenbyak (JIRA) [mailto:j...@apache.org]
Sent: Wednesday, September 23, 2015 22:37
To: dev@tika.apache.org
Subject: [jira] [Created] (TIKA-1749) Upgrade, or shade, guava

Alexander Pogrenbyak created TIKA-1749:
--

 Summary: Upgrade, or shade, guava
 Key: TIKA-1749
 URL: https://issues.apache.org/jira/browse/TIKA-1749
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.10
Reporter: Alexander Pogrenbyak


I use managed dependencies and have guava managed to version 18.0.

The tika-parsers project has guava version 11.0.2

I have a concern that managing up guava 18.0 may break something in Tika
code.

Besides the fact that 11.0.2 is deprecated a long time ago, if Tika has
dependency on a particular version it should shade it for its use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Created] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-24 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1751:
-

 Summary: Use java.nio.file.Path in TikaConfig
 Key: TIKA-1751
 URL: https://issues.apache.org/jira/browse/TIKA-1751
 Project: Tika
  Issue Type: Sub-task
  Components: config
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1751) Use java.nio.file.Path in TikaConfig

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1751:
--
Attachment: TIKA-1751.patch

> Use java.nio.file.Path in TikaConfig
> 
>
> Key: TIKA-1751
> URL: https://issues.apache.org/jira/browse/TIKA-1751
> Project: Tika
>  Issue Type: Sub-task
>  Components: config
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1751.patch
>
>
> Provide constructors accepting java.nio.file.Path



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1752) Use java.nio.file.Path in org.apache.tika.detect

2015-09-24 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1752:
-

 Summary: Use java.nio.file.Path in org.apache.tika.detect
 Key: TIKA-1752
 URL: https://issues.apache.org/jira/browse/TIKA-1752
 Project: Tika
  Issue Type: Sub-task
  Components: detector
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Add constructors and methods accepting java.nio.file.Path to 
TrainedModelDetector & Son.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1734:
--
Labels: java7  (was: )

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1752) Use java.nio.file.Path in org.apache.tika.detect

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1752:
--
Labels: java7  (was: )

> Use java.nio.file.Path in org.apache.tika.detect
> 
>
> Key: TIKA-1752
> URL: https://issues.apache.org/jira/browse/TIKA-1752
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>    Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1752.patch
>
>
> Add constructors and methods accepting java.nio.file.Path to 
> TrainedModelDetector & Son.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-09-24 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1746:
--
Labels: java7  (was: )

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>    Reporter: Yaniv Kunda
>Priority: Minor
>  Labels: java7
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1744) Use java.nio.file.Path in TikaInputStream

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1744:
--
Attachment: TIKA-1744.patch

> Use java.nio.file.Path in TikaInputStream
> -
>
> Key: TIKA-1744
> URL: https://issues.apache.org/jira/browse/TIKA-1744
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1744.patch
>
>
> This will provide support for the new api for users who need it, and provide 
> better information in I/O operations, e.g. detailed exception if file cannot 
> be read.
> - used Path and methods in java.nio.file.Files internally 
> - add getPath() method as the counterpart to getFile()
> - modified test to use 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1745) Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1745:
--
Attachment: TIKA-1745.patch

> Add methods accepting java.nio.file.Path to org.apache.tika.Tika and 
> org.apache.tika.parser.ParsingReader
> -
>
> Key: TIKA-1745
> URL: https://issues.apache.org/jira/browse/TIKA-1745
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1745.patch
>
>
> Add methods accepting java.nio.file.Path to complement those accepting 
> java.io.File, using the new methods in TikaInputStream or java.nio.file.Files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1746) modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path

2015-09-22 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1746:
--
Attachment: TIKA-1746.patch

> modify TikaFileTypeDetector to use new detect method accepting 
> java.nio.file.Path
> -
>
> Key: TIKA-1746
> URL: https://issues.apache.org/jira/browse/TIKA-1746
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1746.patch
>
>
> Utilize the new org.apache.tika.Tika.detect(Path) method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-22 Thread Yaniv Kunda
Yes, using getPath() for the getFile() counterpart.
I'll prepare patches in a few hours.
On Sep 22, 2015 4:35 PM, "Tim Allison (JIRA)" <j...@apache.org> wrote:

>
> [
> https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902613#comment-14902613
> ]
>
> Tim Allison commented on TIKA-1726:
> ---
>
> Thank you, [~kkrugler].  [~kunda], is there enough consensus on this to
> move forward?
>
> > Augment public methods that use a java.io.File with methods that use a
> java.nio.file.Path
> >
> -
> >
> > Key: TIKA-1726
> > URL: https://issues.apache.org/jira/browse/TIKA-1726
> > Project: Tika
> >  Issue Type: Improvement
> >  Components: batch, core, gui, parser, translation
> >Reporter: Yaniv Kunda
> >Priority: Minor
> > Fix For: 1.11
> >
> >
> > In light of Java 7 already EOL, it's high time we add support for the
> new java.nio.file.Path class introduced with it, which, together with
> support methods in java.nio.file.Files and others, provide a better file
> I/O framework than java.io.File.
> > In just two cases, we have public methods in tika that only return a
> File object, and cannot be overloaded, so a different name for the new
> method must be created:
> > - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> > _Suggestions:_
> > -- addTemporaryFile
> > -- addTempFile
> > -- createTempFile
> > -- createTemporaryPath
> > - {{org.apache.tika.io.TikaInputStream#getFile()}}
> > _Suggestions:_
> > -- asFile
> > -- toPath
> > -- getPath
> > In other cases, the methods accept a File as an argument, and should
> remain as tika users might be using them - so an overloaded method that
> accepts a Path instead should be added, referencing the new method from the
> old one (using the @see tag) until java.io.File itself is deprecated or
> otherwise becomes obsolete.
> > Here is the full list of other methods:
> > _tika-app:_
> > - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> > _tika-batch:_
> > - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String,
> HANDLE_EXISTING, String)}}
> > - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> > - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> > -
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> > - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> > - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> > - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> > -
> {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File,
> File)}}
> > - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File,
> File)}}
> > - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> > _tika-core:_
> > - {{org.apache.tika.Tika#detect(File)}}
> > - {{org.apache.tika.Tika#parse(File)}}
> > - {{org.apache.tika.Tika#parseToString(File)}}
> > - {{org.apache.tika.config.TikaConfig}} constructors
> > - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> > - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> > -
> {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> > - {{org.apache.tika.io.TikaInputStream#get(File)}}
> > - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> > _tika-parsers:_
> > - {{org.apache.tika.parser.ParsingReader}} constructor
> > - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> > - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> > - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> > _tika-translate:_
> > -
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
> String[], File)}}
> > Due to lack of evidence, all public methods in public non-test classes
> (and not in tika-example) are deemed part of a public API - although
> there's no formal definition of such.
> > If anyone knows of a public method which isn't accessed publicly and can
> be defined as package-private, or for another reason, please comment.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its aff

RE: [DISCUSS] Release Tika 1.11?

2015-09-21 Thread Yaniv Kunda
Thanks for the positive spirit!

Regarding FilenameUtils.getName() - I believe that its functionality can be
replaced by Path.getFileName() - and in a platform-aware manner, as each JVM
distribution comes with a specific provider implementation for the OS it's
for.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, September 21, 2015 14:27
To: dev@tika.apache.org
Subject: RE: [DISCUSS] Release Tika 1.11?

+1, it would be great to move a bit more into EOL'd Java 7 asap.

I'll take TIKA-1734 by tomorrow EDT.

As for the other 2, I'm personally ok waiting for 1.12, but I defer to the
dev community.

Chris, Nick, Ray, Ken, Konstantin, if you have a chance to chime in on
TIKA-1726, that might help move things forward.

On TIKA-1706, I share Nick's and Jukka's caution, and I also share Yaniv's
point about duplication of code, bloat within Tika and missing out on
updates.   Aside from one small bit of code I'd like to keep or perhaps try
to move into commons-io (?)[0], I think I'm now +1 to going forward with
TIKA-1706 in core...unless there is a -1 from the community.

Best,

 Tim


[1] I added some customizations for old MAC OS behavior (treat ":" as file
separator) in FileNameUtils.getName() that I don't want to lose.


-Original Message-----
From: Yaniv Kunda [mailto:yaniv.ku...@answers.com]
Sent: Sunday, September 20, 2015 7:15 AM
To: dev@tika.apache.org
Subject: RE: [DISCUSS] Release Tika 1.11?

I would really like to push the following:

https://issues.apache.org/jira/browse/TIKA-1706 - Bring back commons-io to
tika-core This requires a decision to re-include commons-io as a dependency
of tika-core.
All the pros and cons have been already debated, but no decision has been
made.

https://issues.apache.org/jira/browse/TIKA-1726 - Augment public methods
that use a java.io.File with methods that use a java.nio.file.Path Since
this adds new methods to the public API, I requested the group to make a
decision about the new names - but have not received something definite.
However, I did create a subtask -
https://issues.apache.org/jira/browse/TIKA-1734 Use java.nio.file.Path in
TemporaryResources - using [~tallison]'s suggestion, which has not been
committed yet.

If decisions are made on the above issues, I can quickly create patches for
them.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Saturday, September 19, 2015 08:10
To: dev@tika.apache.org
Subject: [DISCUSS] Release Tika 1.11?

Hey Guys and Gals,

I’d like to roll a 1.11 release. There is TIKA-1716 which in particular
allows some neat functionality in tika-python:
https://github.com/chrismattmann/tika-python/pull/67


Anything else to try and get into the release?

If not, I’ll produce an RC #1 by end of weekend.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet
Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++

-- 


This email communication (including any attachments) contains information
from Answers Corporation or its affiliates that is confidential and may be
privileged. The information contained herein is intended only for the use of
the addressee(s) named above. If you are not the intended recipient (or the
agent responsible to deliver it to the intended recipient), you are hereby
notified that any dissemination, distribution, use, or copying of this
communication is strictly prohibited. If you have received this email in
error, please immediately reply to sender, delete the message and destroy
all copies of it. If you have questions, please email le...@answers.com.

If you wish to unsubscribe to commercial emails from Answers and its
affiliates, please go to the Answers Subscription Center
http://campaigns.answers.com/subscriptions to opt out.  Thank you.

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

I

RE: [DISCUSS] Release Tika 1.11?

2015-09-20 Thread Yaniv Kunda
I would really like to push the following:

https://issues.apache.org/jira/browse/TIKA-1706 - Bring back commons-io to
tika-core
This requires a decision to re-include commons-io as a dependency of
tika-core.
All the pros and cons have been already debated, but no decision has been
made.

https://issues.apache.org/jira/browse/TIKA-1726 - Augment public methods
that use a java.io.File with methods that use a java.nio.file.Path
Since this adds new methods to the public API, I requested the group to make
a decision about the new names - but have not received something definite.
However, I did create a subtask -
https://issues.apache.org/jira/browse/TIKA-1734 Use java.nio.file.Path in
TemporaryResources - using [~tallison]'s suggestion, which has not been
committed yet.

If decisions are made on the above issues, I can quickly create patches for
them.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Saturday, September 19, 2015 08:10
To: dev@tika.apache.org
Subject: [DISCUSS] Release Tika 1.11?

Hey Guys and Gals,

I’d like to roll a 1.11 release. There is TIKA-1716 which in particular
allows some neat functionality in tika-python:
https://github.com/chrismattmann/tika-python/pull/67


Anything else to try and get into the release?

If not, I’ll produce an RC #1 by end of weekend.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet
Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2015-09-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1738:
--
Attachment: TIKA-1738.patch

This patch moves the bootstrap jar creation to be static and happen only once 
in the class initialization.
Deletion is done using a single shutdown hook, which will *probably* do its 
job, if no handle created by a forked process still references the file - i.e. 
if enough time has passed since the last forked process was destroyed and the 
JVM was shutdown.

It also uses java.nio.file instead of the old java.io package.

Added benefit: performance is better since forked process do not need to create 
the bootstrap jar all over again.
Added drawback: if temp jar is deleted between forks future forks would fail.

> ForkClient does not always delete temporary bootstrap jar
> -
>
> Key: TIKA-1738
> URL: https://issues.apache.org/jira/browse/TIKA-1738
> Project: Tika
>  Issue Type: Bug
>  Components: core
> Environment: Windows 10
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1738.patch
>
>
> ForkClient creates a new temporary bootstrap jar each time it's instantiated, 
> and tries to delete it in the {{close()}} method, after destroying the 
> process.
> Possibly a Windows-specific behavior, the OS seem to still hold a handle to 
> the file a bit after the process is destroyed, causing the delete() method to 
> do nothing.
> This is recreated by simply running ForkParserTest on my machine.
> In a long-running process,this could fill the temp folder with many bootstrap 
> jars that will never be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-16 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1734:
-

 Summary: Use java.nio.file.Path in TemporaryResources
 Key: TIKA-1734
 URL: https://issues.apache.org/jira/browse/TIKA-1734
 Project: Tika
  Issue Type: Sub-task
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


This will provide support for the new api for uses who need it, and provide 
better information in I/O operations, e.g. detailed exception if temporary file 
deletion fails.

- used Path and methods in java.nio.file.Files internally 
- add setTemporaryFileDirectory(Path) method
- add createTempFile() method (mimicking Files.createTempFile)
- add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1734) Use java.nio.file.Path in TemporaryResources

2015-09-16 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1734:
--
Attachment: TIKA-1734.patch

> Use java.nio.file.Path in TemporaryResources
> 
>
> Key: TIKA-1734
> URL: https://issues.apache.org/jira/browse/TIKA-1734
> Project: Tika
>  Issue Type: Sub-task
>  Components: core
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
> Attachments: TIKA-1734.patch
>
>
> This will provide support for the new api for uses who need it, and provide 
> better information in I/O operations, e.g. detailed exception if temporary 
> file deletion fails.
> - used Path and methods in java.nio.file.Files internally 
> - add setTemporaryFileDirectory(Path) method
> - add createTempFile() method (mimicking Files.createTempFile)
> - add unit test for proper deletion of temp files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Adding API support for Java 7's java.nio.file.Path

2015-09-08 Thread Yaniv Kunda
Can we move this forward?



Already decided: Methods using java.io.File will be left as is and added a
@see Javadoc tag to refer to the java.nio.file.Path counterpart.



Not decided yet:

Names for the methods returning a java.nio.file.Path (especially
org.apache.tika.io.TemporaryResources#createTemporaryFile
and org.apache.tika.io.TikaInputStream#getFile)



We need either more opinions or a decision – this is an addition to the
public API so we need a sustainable decision.



*From:* Yaniv Kunda [mailto:yaniv.ku...@answers.com]
*Sent:* Tuesday, September 1, 2015 15:44
*To:* dev@tika.apache.org
*Subject:* RE: Adding API support for Java 7's java.nio.file.Path



I’ve formalized this issue here:

https://issues.apache.org/jira/browse/TIKA-1726



Please take the time and share your opinion on the new method names, so I
can go ahead a provide some patches.



*From:* Yaniv Kunda [mailto:yaniv.ku...@answers.com]
*Sent:* Monday, August 31, 2015 18:51
*To:* dev@tika.apache.org
*Subject:* Re: Adding API support for Java 7's java.nio.file.Path



I've already done that, I'm just waiting for the group's opinions on names
for the new methods, especially the two that I've added to augment
org.apache.tika.io.TemporaryResources#createTemporaryFile
And org.apache.tika.io.TikaInputStream#getFile
As described below.

On Aug 31, 2015 3:26 PM, "Konstantin Gribov" <gros...@gmail.com> wrote:

My two cents, we can migrate to Files.copy, Files.newBufferedReader etc in
places where it can replace commons-io and Tika's internal copy of it.

сб, 29 авг. 2015 г. в 19:48, Ken Krugler <kkrugler_li...@transpac.com>:

>
> > From: Yaniv Kunda
> > Sent: August 29, 2015 2:21:23am PDT
> > To: dev@tika.apache.org
> > Subject: RE: Adding API support for Java 7's java.nio.file.Path
> >
> > In addition to the discussion I've raised about the methods returning a
> > File, I have another problem:
> > Some of the methods that accept a File throw a FileNotFoundException.
> > This exception is thrown by FIS/FOS/RAF constructors in response to
> > anything - from an file that's actually not there to access denied.
> > The NIO api methods usually declare to throw an IOException, which can
> be a
> > subclass representing a more accurate reason - NoSuchFileException or
> > AccessDeniedException.
> >
> > When adding the overloaded methods accepting a Path, I initially thought
> to
> > delegate the old methods to the new ones, but the new ones declare an
> > IOException while the old declare a FileNotFoundException.
> >
> > I have three options:
> > 1) Leave the old methods with their own code -
> > this means essentially duplicate code, but complete backward
> > compatibility.
>
> +1
>
> I don't feel strongly, but I think we get max bang for the development
> buck by doing the simplest thing here.
>
> And it doesn't feel like it'll be that long before Tika 2.0, when the old
> method code can be removed.
>
> -- Ken
>
> > 2) Delegate the old methods to the new ones, but catch the IOException
> and
> > wrap it in a FileNotFoundException -
> > this will remain backward compatible, unless some catching a
> > FileNotFoundException does text analysis on the exception message.
> > 3) Delegate the old methods to the new ones, and change the signature
> > accordingly to throw an IOException instead of a FileNotFoundException -
> > this will break backward compatibility, only in cases a
> > FileNotFoundException was caught explicitly.
> >
> > What do you think?
> >
> > -Original Message-
> > From: Yaniv Kunda [mailto:yaniv.ku...@answers.com]
> > Sent: Friday, August 28, 2015 03:33
> > To: dev@tika.apache.org
> > Subject: RE: Adding API support for Java 7's java.nio.file.Path
> >
> > Thanks, I just like to move things forward :-)
> >
> > Regarding my proposed API additions -
> > since adding new methods will make them a part of a new API, this is a
> > change to make their names more meaningful/concise/correct: replacing
> File
> > with Path in the method name might be awkward.
> >
> > I'd like to gather alternatives for the changes/additions to methods
that
> > return a File.
> > I found a total of 4 methods that return a java.io.File and are public,
> in
> > public non-test classes and not in tika-example (I assume the rest can
be
> > changed without breaking anything).
> > For each method I will provide my suggestion/s, which will be either
"Add
> > newName", "Replace with newName" or "Keep":
> >
> > tika-batch:
> > - org.apache.tika.batch.fs.FSUtil#getOutputFile
> > + Keep
> > - org.apache.t

[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-05 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1726:
--
Description: 
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, referencing the new method from the old one using 
(using the @see tag) deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

_tika-core:_
- {{org.apache.tika.Tika#detect(File)}}
- {{org.apache.tika.Tika#parse(File)}}
- {{org.apache.tika.Tika#parseToString(File)}}
- {{org.apache.tika.config.TikaConfig}} constructors
- {{org.apache.tika.detect.NNExampleModelDetector}} constructor
- {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
- {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}

_tika-parsers:_
- {{org.apache.tika.parser.ParsingReader}} constructor
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
- {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor

_tika-translate:_
- 
{{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, 
String[], File)}}

Due to lack of evidence, all public methods in public non-test classes (and not 
in tika-example) are deemed part of a public API - although there's no formal 
definition of such.
If anyone knows of a public method which isn't accessed publicly and can be 
defined as package-private, or for another reason, please comment.


  was:
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

[jira] [Updated] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-05 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1726:
--
Description: 
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
-- createTemporaryPath
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, referencing the new method from the old one (using the 
@see tag) until java.io.File itself is deprecated or otherwise becomes obsolete.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

_tika-core:_
- {{org.apache.tika.Tika#detect(File)}}
- {{org.apache.tika.Tika#parse(File)}}
- {{org.apache.tika.Tika#parseToString(File)}}
- {{org.apache.tika.config.TikaConfig}} constructors
- {{org.apache.tika.detect.NNExampleModelDetector}} constructor
- {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
- {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}

_tika-parsers:_
- {{org.apache.tika.parser.ParsingReader}} constructor
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
- {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor

_tika-translate:_
- 
{{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, 
String[], File)}}

Due to lack of evidence, all public methods in public non-test classes (and not 
in tika-example) are deemed part of a public API - although there's no formal 
definition of such.
If anyone knows of a public method which isn't accessed publicly and can be 
defined as package-private, or for another reason, please comment.


  was:
In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, referencing the new method from the old one using 
(using the @see tag) deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil

RE: OSGi exceptions in trunk w Intellij +Osmorc

2015-09-03 Thread Yaniv Kunda
Probably JDom:
http://www.jdom.org/pipermail/jdom-interest/2008-November/016226.html
https://developer.atlassian.com/docs/faq/plugin-framework-faq/using-jdom-in-osgi
On Sep 3, 2015 9:05 PM, "Allison, Timothy B."  wrote:

> Interesting, thank you.
>
> I wasn't getting any pointers to the offending class before I removed the
> plugin.
>
> Any recommendations on finding the offender?
>
> -Original Message-
> From: Bob Paulin [mailto:b...@bobpaulin.com]
> Sent: Thursday, September 03, 2015 11:21 AM
> To: dev@tika.apache.org
> Subject: Re: OSGi exceptions in trunk w Intellij +Osmorc
>
> It's likely one of the embedded dependencies have class files in the
> default package.  If these classes are not being used they could just be
> removed as suggested here:
>
>
> https://techotom.wordpress.com/2014/10/21/fixing-the-default-package-is-not-permitted-by-the-import-package-syntax-with-maven-bundle-plugin/
>
> Do we know which dependency this might be?  I agree that it would be
> better if this all worked in Intellij with the Osmorc plugin.
>
> - Bob
>
> On Thu, Sep 3, 2015 at 10:10 AM, Allison, Timothy B. 
> wrote:
>
> > All,
> >
> > I'm able to build via Maven without any problem.  However, within
> > Intellij, I'm not able to run any individual unit tests in
> > tika-parsers or tika-xmp because of this error:
> >
> > Error:osgi: [tika-parsers] The default package '.' is not permitted by
> > the Import-Package syntax.
> >  This can be caused by compile errors in Eclipse because Eclipse
> > creates valid class files regardless of compile errors.
> > The following package(s) import from the default package null
> >
> > If I remove Osmorc (the OSGi plugin), all is ok, but that seems like a
> > really bad idea.  Is this something we should fix, or is this
> > something that I should ignore?
> >
> >  Best,
> >
> >   Tim
> >
> >
>

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Commented] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-01 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725769#comment-14725769
 ] 

Yaniv Kunda commented on TIKA-1726:
---

Funny you proposed those two alternatives - exactly what I started with... But 
compared to the methods in the Java platform it seems partly incorrect as these 
mostly deal with files, located using paths, e.g. Files.createTempFile.
So for createTemporaryFile I think that createTemporaryPath is problematic in 
the sense that an actual file is created, not just a path.
I suggested the add* variants to hint that the file is added to the list of 
resources to close, as in addResource.
For getFile, getPath is actually pretty ok but I think both are problematic in 
that they look like a getter - I wanted to signify its write-to-file 
functionality.
How about save/store/persist?

Regarding deprecation, not a problem - I'll drop it and add a @see tag from the 
old method to the new one (but not the other way round?).

In both questions, my suggestions are only suggestions, and my reservations are 
only reservations - but if you can take a call and make any decision I'd be 
happy to accept it and move this forward.



> Augment public methods that use a java.io.File with methods that use a 
> java.nio.file.Path
> -
>
> Key: TIKA-1726
> URL: https://issues.apache.org/jira/browse/TIKA-1726
> Project: Tika
>  Issue Type: Improvement
>  Components: batch, core, gui, parser, translation
>    Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.11
>
>
> In light of Java 7 already EOL, it's high time we add support for the new 
> java.nio.file.Path class introduced with it, which, together with support 
> methods in java.nio.file.Files and others, provide a better file I/O 
> framework than java.io.File.
> In just two cases, we have public methods in tika that only return a File 
> object, and cannot be overloaded, so a different name for the new method must 
> be created:
> - {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
> _Suggestions:_
> -- addTemporaryFile
> -- addTempFile
> -- createTempFile
> - {{org.apache.tika.io.TikaInputStream#getFile()}}
> _Suggestions:_
> -- asFile
> -- toPath
> -- getPath
> In other cases, the methods accept a File as an argument, and should remain 
> as tika users might be using them - so an overloaded method that accepts a 
> Path instead should be added, deprecating the old method until an unknown 
> tika major release.
> Here is the full list of other methods:
> _tika-app:_
> - {{org.apache.tika.gui.TikaGUI#openFile(File)}}
> _tika-batch:_
> - {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
> HANDLE_EXISTING, String)}}
> - {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
> - {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
> - 
> {{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
> - {{org.apache.tika.batch.fs.FSFileResource}} constructor
> - {{org.apache.tika.batch.fs.FSListCrawler}} constructor
> - {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
> File)}}
> - {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
> - {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor
> _tika-core:_
> - {{org.apache.tika.Tika#detect(File)}}
> - {{org.apache.tika.Tika#parse(File)}}
> - {{org.apache.tika.Tika#parseToString(File)}}
> - {{org.apache.tika.config.TikaConfig}} constructors
> - {{org.apache.tika.detect.NNExampleModelDetector}} constructor
> - {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
> - {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File)}}
> - {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}
> _tika-parsers:_
> - {{org.apache.tika.parser.ParsingReader}} constructor
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
> - {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
> - {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor
> _tika-translate:_
> - 
> {{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String,
>  String[], File)}}
> Due to lack of evidence, all public methods in public non-test classes (and 
> not in tika-example) are deemed part of a public API - although there's no 
> formal definition of such.
> If anyone knows of a public method which isn't accessed publicly and can be 
> defined as package-private, or for another reason, please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1726) Augment public methods that use a java.io.File with methods that use a java.nio.file.Path

2015-09-01 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1726:
-

 Summary: Augment public methods that use a java.io.File with 
methods that use a java.nio.file.Path
 Key: TIKA-1726
 URL: https://issues.apache.org/jira/browse/TIKA-1726
 Project: Tika
  Issue Type: Improvement
  Components: batch, core, gui, parser, translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


In light of Java 7 already EOL, it's high time we add support for the new 
java.nio.file.Path class introduced with it, which, together with support 
methods in java.nio.file.Files and others, provide a better file I/O framework 
than java.io.File.

In just two cases, we have public methods in tika that only return a File 
object, and cannot be overloaded, so a different name for the new method must 
be created:
- {{org.apache.tika.io.TemporaryResources#createTemporaryFile()}}
_Suggestions:_
-- addTemporaryFile
-- addTempFile
-- createTempFile
- {{org.apache.tika.io.TikaInputStream#getFile()}}
_Suggestions:_
-- asFile
-- toPath
-- getPath

In other cases, the methods accept a File as an argument, and should remain as 
tika users might be using them - so an overloaded method that accepts a Path 
instead should be added, deprecating the old method until an unknown tika major 
release.
Here is the full list of other methods:
_tika-app:_
- {{org.apache.tika.gui.TikaGUI#openFile(File)}}

_tika-batch:_
- {{org.apache.tika.batch.fs.FSUtil#getOutputFile(File, String, 
HANDLE_EXISTING, String)}}
- {{org.apache.tika.util.PropsUtil#getFile(String, File)}}
- {{org.apache.tika.batch.fs.FSDirectoryCrawler}} constructors
- 
{{org.apache.tika.batch.fs.FSDirectoryCrawler#handleFirstFileInDirectory(File)}}
- {{org.apache.tika.batch.fs.FSFileResource}} constructor
- {{org.apache.tika.batch.fs.FSListCrawler}} constructor
- {{org.apache.tika.batch.fs.FSOutputStreamFactory}} constructor
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfOrSameAsThat(File, 
File)}}
- {{org.apache.tika.batch.fs.FSUtil#checkThisIsAncestorOfThat(File, File)}}
- {{org.apache.tika.batch.fs.strawman.StrawManTikaAppDriver}} constructor

_tika-core:_
- {{org.apache.tika.Tika#detect(File)}}
- {{org.apache.tika.Tika#parse(File)}}
- {{org.apache.tika.Tika#parseToString(File)}}
- {{org.apache.tika.config.TikaConfig}} constructors
- {{org.apache.tika.detect.NNExampleModelDetector}} constructor
- {{org.apache.tika.detect.TrainedModelDetector#loadDefaultModels(File)}}
- {{org.apache.tika.io.TemporaryResources#setTemporaryFileDirectory(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File)}}
- {{org.apache.tika.io.TikaInputStream#get(File, Metadata)}}

_tika-parsers:_
- {{org.apache.tika.parser.ParsingReader}} constructor
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseJpeg(File)}}
- {{org.apache.tika.parser.image.ImageMetadataExtractor#parseWebP(File)}}
- {{org.apache.tika.parser.mp4.DirectFileReadDataSource}} constructor

_tika-translate:_
- 
{{org.apache.tika.language.translate.ExternalTranslator#runAndGetOutput(String, 
String[], File)}}

Due to lack of evidence, all public methods in public non-test classes (and not 
in tika-example) are deemed part of a public API - although there's no formal 
definition of such.
If anyone knows of a public method which isn't accessed publicly and can be 
defined as package-private, or for another reason, please comment.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Adding API support for Java 7's java.nio.file.Path

2015-09-01 Thread Yaniv Kunda
I’ve formalized this issue here:

https://issues.apache.org/jira/browse/TIKA-1726



Please take the time and share your opinion on the new method names, so I
can go ahead a provide some patches.



*From:* Yaniv Kunda [mailto:yaniv.ku...@answers.com]
*Sent:* Monday, August 31, 2015 18:51
*To:* dev@tika.apache.org
*Subject:* Re: Adding API support for Java 7's java.nio.file.Path



I've already done that, I'm just waiting for the group's opinions on names
for the new methods, especially the two that I've added to augment
org.apache.tika.io.TemporaryResources#createTemporaryFile
And org.apache.tika.io.TikaInputStream#getFile
As described below.

On Aug 31, 2015 3:26 PM, "Konstantin Gribov" <gros...@gmail.com> wrote:

My two cents, we can migrate to Files.copy, Files.newBufferedReader etc in
places where it can replace commons-io and Tika's internal copy of it.

сб, 29 авг. 2015 г. в 19:48, Ken Krugler <kkrugler_li...@transpac.com>:

>
> > From: Yaniv Kunda
> > Sent: August 29, 2015 2:21:23am PDT
> > To: dev@tika.apache.org
> > Subject: RE: Adding API support for Java 7's java.nio.file.Path
> >
> > In addition to the discussion I've raised about the methods returning a
> > File, I have another problem:
> > Some of the methods that accept a File throw a FileNotFoundException.
> > This exception is thrown by FIS/FOS/RAF constructors in response to
> > anything - from an file that's actually not there to access denied.
> > The NIO api methods usually declare to throw an IOException, which can
> be a
> > subclass representing a more accurate reason - NoSuchFileException or
> > AccessDeniedException.
> >
> > When adding the overloaded methods accepting a Path, I initially thought
> to
> > delegate the old methods to the new ones, but the new ones declare an
> > IOException while the old declare a FileNotFoundException.
> >
> > I have three options:
> > 1) Leave the old methods with their own code -
> > this means essentially duplicate code, but complete backward
> > compatibility.
>
> +1
>
> I don't feel strongly, but I think we get max bang for the development
> buck by doing the simplest thing here.
>
> And it doesn't feel like it'll be that long before Tika 2.0, when the old
> method code can be removed.
>
> -- Ken
>
> > 2) Delegate the old methods to the new ones, but catch the IOException
> and
> > wrap it in a FileNotFoundException -
> > this will remain backward compatible, unless some catching a
> > FileNotFoundException does text analysis on the exception message.
> > 3) Delegate the old methods to the new ones, and change the signature
> > accordingly to throw an IOException instead of a FileNotFoundException -
> > this will break backward compatibility, only in cases a
> > FileNotFoundException was caught explicitly.
> >
> > What do you think?
> >
> > -Original Message-
> > From: Yaniv Kunda [mailto:yaniv.ku...@answers.com]
> > Sent: Friday, August 28, 2015 03:33
> > To: dev@tika.apache.org
> > Subject: RE: Adding API support for Java 7's java.nio.file.Path
> >
> > Thanks, I just like to move things forward :-)
> >
> > Regarding my proposed API additions -
> > since adding new methods will make them a part of a new API, this is a
> > change to make their names more meaningful/concise/correct: replacing
> File
> > with Path in the method name might be awkward.
> >
> > I'd like to gather alternatives for the changes/additions to methods
that
> > return a File.
> > I found a total of 4 methods that return a java.io.File and are public,
> in
> > public non-test classes and not in tika-example (I assume the rest can
be
> > changed without breaking anything).
> > For each method I will provide my suggestion/s, which will be either
"Add
> > newName", "Replace with newName" or "Keep":
> >
> > tika-batch:
> > - org.apache.tika.batch.fs.FSUtil#getOutputFile
> > + Keep
> > - org.apache.tika.util.PropsUtil#getFile
> > + Keep
> >
> > tika-core:
> > - org.apache.tika.io.TemporaryResources#createTemporaryFile
> > + Add addTemporaryFile
> > Add addTempFile
> > Add createTempFile
> > - org.apache.tika.io.TikaInputStream#getFile
> > + Add asFile
> > Add toPath
> > Add getPath
> >
> > I've added a '+' to the left of my preference - please add yours to your
> > preference or add a new suggestion.
> >
> > Regarding added methods - I really think that the old methods should be
> > deprecated.
> > IMO a typo or a simple name change is a good e

Re: Adding API support for Java 7's java.nio.file.Path

2015-08-31 Thread Yaniv Kunda
I've already done that, I'm just waiting for the group's opinions on names
for the new methods, especially the two that I've added to augment
org.apache.tika.io.TemporaryResources#createTemporaryFile
And org.apache.tika.io.TikaInputStream#getFile
As described below.
On Aug 31, 2015 3:26 PM, "Konstantin Gribov" <gros...@gmail.com> wrote:

> My two cents, we can migrate to Files.copy, Files.newBufferedReader etc in
> places where it can replace commons-io and Tika's internal copy of it.
>
> сб, 29 авг. 2015 г. в 19:48, Ken Krugler <kkrugler_li...@transpac.com>:
>
> >
> > > From: Yaniv Kunda
> > > Sent: August 29, 2015 2:21:23am PDT
> > > To: dev@tika.apache.org
> > > Subject: RE: Adding API support for Java 7's java.nio.file.Path
> > >
> > > In addition to the discussion I've raised about the methods returning a
> > > File, I have another problem:
> > > Some of the methods that accept a File throw a FileNotFoundException.
> > > This exception is thrown by FIS/FOS/RAF constructors in response to
> > > anything - from an file that's actually not there to access denied.
> > > The NIO api methods usually declare to throw an IOException, which can
> > be a
> > > subclass representing a more accurate reason - NoSuchFileException or
> > > AccessDeniedException.
> > >
> > > When adding the overloaded methods accepting a Path, I initially
> thought
> > to
> > > delegate the old methods to the new ones, but the new ones declare an
> > > IOException while the old declare a FileNotFoundException.
> > >
> > > I have three options:
> > > 1) Leave the old methods with their own code -
> > > this means essentially duplicate code, but complete backward
> > > compatibility.
> >
> > +1
> >
> > I don't feel strongly, but I think we get max bang for the development
> > buck by doing the simplest thing here.
> >
> > And it doesn't feel like it'll be that long before Tika 2.0, when the old
> > method code can be removed.
> >
> > -- Ken
> >
> > > 2) Delegate the old methods to the new ones, but catch the IOException
> > and
> > > wrap it in a FileNotFoundException -
> > > this will remain backward compatible, unless some catching a
> > > FileNotFoundException does text analysis on the exception message.
> > > 3) Delegate the old methods to the new ones, and change the signature
> > > accordingly to throw an IOException instead of a FileNotFoundException
> -
> > > this will break backward compatibility, only in cases a
> > > FileNotFoundException was caught explicitly.
> > >
> > > What do you think?
> > >
> > > -Original Message-
> > > From: Yaniv Kunda [mailto:yaniv.ku...@answers.com]
> > > Sent: Friday, August 28, 2015 03:33
> > > To: dev@tika.apache.org
> > > Subject: RE: Adding API support for Java 7's java.nio.file.Path
> > >
> > > Thanks, I just like to move things forward :-)
> > >
> > > Regarding my proposed API additions -
> > > since adding new methods will make them a part of a new API, this is a
> > > change to make their names more meaningful/concise/correct: replacing
> > File
> > > with Path in the method name might be awkward.
> > >
> > > I'd like to gather alternatives for the changes/additions to methods
> that
> > > return a File.
> > > I found a total of 4 methods that return a java.io.File and are public,
> > in
> > > public non-test classes and not in tika-example (I assume the rest can
> be
> > > changed without breaking anything).
> > > For each method I will provide my suggestion/s, which will be either
> "Add
> > > newName", "Replace with newName" or "Keep":
> > >
> > > tika-batch:
> > > - org.apache.tika.batch.fs.FSUtil#getOutputFile
> > > + Keep
> > > - org.apache.tika.util.PropsUtil#getFile
> > > + Keep
> > >
> > > tika-core:
> > > - org.apache.tika.io.TemporaryResources#createTemporaryFile
> > > + Add addTemporaryFile
> > > Add addTempFile
> > > Add createTempFile
> > > - org.apache.tika.io.TikaInputStream#getFile
> > > + Add asFile
> > > Add toPath
> > > Add getPath
> > >
> > > I've added a '+' to the left of my preference - please add yours to
> your
> > > preference or add a new suggestion.
> > >
> > > Regarding added methods - I really think 

Re: [jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-08-31 Thread Yaniv Kunda
I believe the tika-java7 component must remain optional, as its sole
purpose is to serve as a concrete SPI implementation of FileTypeDetector,
most commonly used in
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#probeContentType-java.nio.file.Path-

I do agree that a name change can help - here are a few suggestions:
tika-java7-spi
tika-java7-filetypedetector
tika-java7-detector-spi

On Aug 31, 2015 7:53 AM, "Tyler Palsulich (JIRA)"  wrote:
>
>
> [
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722705#comment-14722705
]
>
> Tyler Palsulich commented on TIKA-1672:
> ---
>
> Hmm. Maybe we should rename the module? Right now, it doesn't make sense
to have a java7 component when the entire project depends on Java 7.
>
> > Integrate tika-java7 component
> > --
> >
> > Key: TIKA-1672
> > URL: https://issues.apache.org/jira/browse/TIKA-1672
> > Project: Tika
> >  Issue Type: Improvement
> >Reporter: Tyler Palsulich
> > Fix For: 1.11
> >
> >
> > Code requiring Java 7 doesn't need to be in a separate module now that
TIKA-1536 (upgrade to Java 7) is done.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


try-with-resources

2015-08-29 Thread Yaniv Kunda
I’ve opened https://issues.apache.org/jira/browse/TIKA-1719 along with a
patch that converts applicable code to use the try-with-resources statement.

Although the patch is big and covers 105 files, it’s very shallow and
contains only trivial use cases – most of them fixed by IntelliJ’s
quick-fix.



I would appreciate if any committer can review this and push it through –

I already have other changes (using Java 7’s java.nio.file.Path) waiting
for it to avoid conflicts.



If this is too much, I can separate it to different patches, per module or
any other discriminator –

although the absolute majority is in tika-parser’s tests.





*Yaniv Kunda*
Technical Lead
yaniv.ku...@answers.com

*p* +972 (3) 7661819
*m* +972 (54) 4644456

[image: Webcollage by Answers]
www.answers.com/webcollage

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717349#comment-14717349
 ] 

Yaniv Kunda commented on TIKA-1706:
---

The fact that o.a.tika.io contains public classes is a problem I didn't think 
about -
these files are strictly meant as internal utility/support classes and 
shouldn't really be used by users.
In fact, I would say although these are public classes, they should not be 
considered a part of the public API of tika-core.
And since we don't know what commons-io-cloned classes users use (probably by 
accident), it is indeed a problem letting these go.

I also think that the no-dependencies principle is more romantic than it is 
useful, as these days a lot of the Java ecosystem is built on using external 
libraries, unless space is critical such as in mobile applications (and even 
these are getting bigger and bigger).
As the vast majority of tika-core usages comes transitively from tika-parsers, 
I think this is not the case.
I haven't crawled maven repo (deep enough) to find how many tika-code exclusive 
usages have a few or no other dependencies, but I suspect that number is not 
very high.
So the absolute worst case here - and remember that this is the extreme case of 
a library that uses tika-core and no other library - is a 30% footprint 
increase!

o.a.tika.io is a mess - it contains:
- classes from commons-io-1.4
- partial classes from commons-io-1.4
- modified classes from commons-io-1.4
- classes from commons-io-2.0 (or later unknown version/s)
- tika original classes

It's really hard going over all changes - and I've shown just a few examples - 
but just doing the switch is simply easier, not so costly even in the worst 
case, and would bring progress to our doorstep (today and in future changes) by 
exploration faster than maintaining copied code.

My suggestion is:
- bring commons-io back to tika-core
- change all usages of the copied classes to commons-io
- deprecate (do not delete) the copied classes, probably until tika-2.0




 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-27 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717519#comment-14717519
 ] 

Yaniv Kunda commented on TIKA-1706:
---

That's why I suggested to just add commons-io to tika-core, use it internally, 
and just deprecate the copied classes.
Is that ok for 1.x?

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Adding API support for Java 7's java.nio.file.Path

2015-08-27 Thread Yaniv Kunda
Thanks, I just like to move things forward :-)

Regarding my proposed API additions -
since adding new methods will make them a part of a new API, this is a
change to make their names more meaningful/concise/correct: replacing File
with Path in the method name might be awkward.

I'd like to gather alternatives for the changes/additions to methods that
return a File.
I found a total of 4 methods that return a java.io.File and are public, in
public non-test classes and not in tika-example (I assume the rest can be
changed without breaking anything).
For each method I will provide my suggestion/s, which will be either Add
newName, Replace with newName or Keep:

tika-batch:
- org.apache.tika.batch.fs.FSUtil#getOutputFile
+ Keep
- org.apache.tika.util.PropsUtil#getFile
+ Keep

tika-core:
- org.apache.tika.io.TemporaryResources#createTemporaryFile
+ Add addTemporaryFile
Add addTempFile
Add createTempFile
- org.apache.tika.io.TikaInputStream#getFile
+ Add asFile
Add toPath
Add getPath

I've added a '+' to the left of my preference -
please add yours to your preference or add a new suggestion.

Regarding added methods - I really think that the old methods should be
deprecated.
IMO a typo or a simple name change is a good enough reason for deprecating a
method - so returning a legacy class makes it even more welcome.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, August 27, 2015 17:36
To: dev@tika.apache.org
Subject: RE: Adding API support for Java 7's java.nio.file.Path

+1

Thank you, Yaniv, for leading this effort.

I have a small preference for getting rid of File entirely eventually (2.0?)
as Lucene and Hadoop seem to have done (?).

-Original Message-
From: Yaniv Kunda [mailto:yaniv.ku...@answers.com]
Sent: Wednesday, August 26, 2015 5:31 PM
To: dev@tika.apache.org
Subject: RE: Adding API support for Java 7's java.nio.file.Path

I can point out several benefits of supporting the new API, in no particular
order:
- Exception handling: operations like File.delete return a boolean which
provides less useful information if the operation failed than the exception
thrown by Files.delete() (or a Minion...)
- Performance: The new API delegates more parts of I/O operations to the OS,
resulting in better usage of resources.
In independent testing I've done (considering big files, cache warmup and
randomized order) I've achieved 30% faster reads when using Files.copy() or
FileChannel.transferTo()
- Adoption: Java 7, in which the new API appeared, is already EOL.
Supporting this API, considering that java.io is considered legacy, is good
for keeping us with times, and even better for our users as it offers them
an incentive of moving forward as well.

More can be found here:
http://docs.oracle.com/javase/tutorial/essential/io/legacy.html

I believe that the library - user relationship must have a balance between
compatibility and progress, as if libraries are stuck at compatibility - the
users are sometimes stuck without progress...
If we can have progress without breaking compatibility - we have a winner.

I propose to add support for and make the most of the new (4 y/o) API
without breaking compatibility, which means:
- Public methods accepting a File will not be changed; overloaded versions
will be added.
- Public methods returning a File will not be changed; methods with
different names will be added.
- Non-public methods accepting or returning a File will be changed
- Internal uses of the legacy I/O will be updated to use the new API where
easy

Regarding deprecation, I suggest that:
1) Methods accepting a File will not be deprecated - they will probably be
used as long as File itself is not deprecated (forever?)
2) Methods returning a File will be deprecated - progressive users can use
the new methods easily, less progressive can use the new methods adding
.toFile() to the result, and the rest can still use the deprecated methods
(which will most likely call the new methods internally anyway).
To summarize: overloading = convenience, methods with the same operation but
different name and return value = confusing.

If this seems like a decent proposal, I can separate this work into several
JIRA issues and patches, so that reviewing the changes is easier.

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Wednesday, August 26, 2015 13:27
To: dev@tika.apache.org
Subject: Re: Adding API support for Java 7's java.nio.file.Path

On Wed, 26 Aug 2015, Yaniv Kunda wrote:
 I would like to propose adding support for Java 7’s java.nio.file.Path
 as an alternative to those methods in the API that deal with a
 java.io.File.

Any chance you could briefly summarise what advantages this would give to us
and/or our users?

 1)  What can we do with methods returning a File? e.g.
 TemporaryResources.createTemporaryFile, TikaInputStream.getFile.
 Should we break compatibility and encourage (=force) users to change
 their code (Note that since

RE: Adding API support for Java 7's java.nio.file.Path

2015-08-26 Thread Yaniv Kunda
I can point out several benefits of supporting the new API, in no particular
order:
- Exception handling: operations like File.delete return a boolean which
provides less useful information if the operation failed than the exception
thrown by Files.delete() (or a Minion...)
- Performance: The new API delegates more parts of I/O operations to the OS,
resulting in better usage of resources.
In independent testing I've done (considering big files, cache warmup and
randomized order) I've achieved 30% faster reads when using Files.copy() or
FileChannel.transferTo()
- Adoption: Java 7, in which the new API appeared, is already EOL.
Supporting this API, considering that java.io is considered legacy, is good
for keeping us with times, and even better for our users as it offers them
an incentive of moving forward as well.

More can be found here:
http://docs.oracle.com/javase/tutorial/essential/io/legacy.html

I believe that the library - user relationship must have a balance between
compatibility and progress, as if libraries are stuck at compatibility - the
users are sometimes stuck without progress...
If we can have progress without breaking compatibility - we have a winner.

I propose to add support for and make the most of the new (4 y/o) API
without breaking compatibility, which means:
- Public methods accepting a File will not be changed; overloaded versions
will be added.
- Public methods returning a File will not be changed; methods with
different names will be added.
- Non-public methods accepting or returning a File will be changed
- Internal uses of the legacy I/O will be updated to use the new API where
easy

Regarding deprecation, I suggest that:
1) Methods accepting a File will not be deprecated - they will probably be
used as long as File itself is not deprecated (forever?)
2) Methods returning a File will be deprecated - progressive users can use
the new methods easily, less progressive can use the new methods adding
.toFile() to the result, and the rest can still use the deprecated methods
(which will most likely call the new methods internally anyway).
To summarize: overloading = convenience, methods with the same operation but
different name and return value = confusing.

If this seems like a decent proposal, I can separate this work into several
JIRA issues and patches, so that reviewing the changes is easier.

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Wednesday, August 26, 2015 13:27
To: dev@tika.apache.org
Subject: Re: Adding API support for Java 7's java.nio.file.Path

On Wed, 26 Aug 2015, Yaniv Kunda wrote:
 I would like to propose adding support for Java 7’s java.nio.file.Path
 as an alternative to those methods in the API that deal with a
 java.io.File.

Any chance you could briefly summarise what advantages this would give to us
and/or our users?

 1)  What can we do with methods returning a File? e.g.
 TemporaryResources.createTemporaryFile, TikaInputStream.getFile.
 Should we break compatibility and encourage (=force) users to change
 their code (Note that since they all use Java 7 now, the change is
 minimal by adding .toFile() to the result), or create new methods with
 different names (confusing)?

Breaking compatibility outside of a 2.0 release is a big no-no, sorry.

TemporaryResources.createTemporaryPath and TikaInputStream.getPath could
work as naming

 2)  Should we deprecate the old methods accepting a File, or delete
 them?

Deleting would break compatibility, so shouldn't be done. Deprecating could
be done, if there's a strong reason to encourage people off them


https://wiki.apache.org/tika/Tika2_0RoadMap is where we're tracking
proposed API-breaking changes for 2.0

Nick

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Created] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()

2015-08-26 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1720:
-

 Summary: Collect multiple exceptions in TemporaryResources.close() 
using Throwable.addSuppressed()
 Key: TIKA-1720
 URL: https://issues.apache.org/jira/browse/TIKA-1720
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


TemporaryResource.close() currently collects exceptions throw by trying to 
close its resources in a list.
When the time to propagate an exception comes, information is lost - the thrown 
exception contains a message with the string descriptions of all exceptions, 
and the first exception as the cause - there is no stack trace describing what 
went wrong closing a resource.
In addition, the thrown exception is IOExceptionWithCause, copied from 
commons-io, which is redundant since Java 6.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1721) Replace IOExceptionWithCause in ForkClient

2015-08-26 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1721:
-

 Summary: Replace IOExceptionWithCause in ForkClient
 Key: TIKA-1721
 URL: https://issues.apache.org/jira/browse/TIKA-1721
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


IOExceptionWithCause (copied from commons-io) is redundant since Java 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1720) Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()

2015-08-26 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1720:
--
Attachment: TIKA-1720.patch

 Collect multiple exceptions in TemporaryResources.close() using 
 Throwable.addSuppressed()
 -

 Key: TIKA-1720
 URL: https://issues.apache.org/jira/browse/TIKA-1720
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1720.patch


 TemporaryResource.close() currently collects exceptions throw by trying to 
 close its resources in a list.
 When the time to propagate an exception comes, information is lost - the 
 thrown exception contains a message with the string descriptions of all 
 exceptions, and the first exception as the cause - there is no stack trace 
 describing what went wrong closing a resource.
 In addition, the thrown exception is IOExceptionWithCause, copied from 
 commons-io, which is redundant since Java 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Adding API support for Java 7's java.nio.file.Path

2015-08-26 Thread Yaniv Kunda
I would like to propose adding support for Java 7’s java.nio.file.Path as
an alternative to those methods in the API that deal with a java.io.File.

This is pretty trivial for File as a param, as new overloaded
methods/constructors can be added that accept a Path.



A few questions arise:

1)  What can we do with methods returning a File? e.g.
TemporaryResources.createTemporaryFile, TikaInputStream.getFile.
Should we break compatibility and encourage (=force) users to change their
code (Note that since they all use Java 7 now, the change is minimal by
adding .toFile() to the result),
or create new methods with different names (confusing)?

2)  Should we deprecate the old methods accepting a File, or delete
them?



I’m ready to open an issue and provide patches.



*Yaniv Kunda*
Technical Lead
yaniv.ku...@answers.com

*p* +972 (3) 7661819
*m* +972 (54) 4644456

[image: Webcollage by Answers]
www.answers.com/webcollage

-- 


This email communication (including any attachments) contains information 
from Answers Corporation or its affiliates that is confidential and may be 
privileged. The information contained herein is intended only for the use 
of the addressee(s) named above. If you are not the intended recipient (or 
the agent responsible to deliver it to the intended recipient), you are 
hereby notified that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited. If you have received this email 
in error, please immediately reply to sender, delete the message and 
destroy all copies of it. If you have questions, please email 
le...@answers.com. 

If you wish to unsubscribe to commercial emails from Answers and its 
affiliates, please go to the Answers Subscription Center 
http://campaigns.answers.com/subscriptions to opt out.  Thank you.


[jira] [Updated] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL

2015-08-26 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1722:
--
Attachment: TIKA-1722.patch

 Tika methods that accept a File needlessly convert it to a URL
 --

 Key: TIKA-1722
 URL: https://issues.apache.org/jira/browse/TIKA-1722
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1722.patch


 The following methods:
 - Tika.detect(File)
 - Tika.parse(File)
 - Tika.parseToString(File)
 Convert the given File to a URL and use the corresponding overloaded method 
 that accepts a URL.
 This seems like a shortcut, but essentially does the following:
 # Converts the file to a URI
 # Converts the URI to a URL
 # Calls TikaInputStream.get(URL, Metadata), which then performs the following 
 special handling:
 # Checks if the protocol is file
 # Tries to convert the URL (back) to a URI
 # Creates a File around the URI
 # Checks if file.isFile() 
 # Calls TikaInputStream.get(File, Metadata)
 The special handling in TikaInputStream.get(URL/URI) is a good optimization 
 for in-the-wild file resources, but for internal uses it can be skipped - 
 making Tika call TikaInputStream.get(File, Metadata) directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1722) Tika methods that accept a File needlessly convert it to a URL

2015-08-26 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1722:
-

 Summary: Tika methods that accept a File needlessly convert it to 
a URL
 Key: TIKA-1722
 URL: https://issues.apache.org/jira/browse/TIKA-1722
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


The following methods:
- Tika.detect(File)
- Tika.parse(File)
- Tika.parseToString(File)

Convert the given File to a URL and use the corresponding overloaded method 
that accepts a URL.
This seems like a shortcut, but essentially does the following:
# Converts the file to a URI
# Converts the URI to a URL
# Calls TikaInputStream.get(URL, Metadata), which then performs the following 
special handling:
# Checks if the protocol is file
# Tries to convert the URL (back) to a URI
# Creates a File around the URI
# Checks if file.isFile() 
# Calls TikaInputStream.get(File, Metadata)

The special handling in TikaInputStream.get(URL/URI) is a good optimization for 
in-the-wild file resources, but for internal uses it can be skipped - making 
Tika call TikaInputStream.get(File, Metadata) directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1711:
--
Attachment: (was: TIKA-1711.patch)

 Remove java6-activated profile from tika-bundle and move its plugins to 
 default build
 -

 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 Since the project now requires Java 7, there's no point in allowing Java 6+ 
 since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1711:
--
Summary: Remove java6-activated profile from tika-bundle and move its 
plugins to default build  (was: Modify tika-bundle profile activation to 
require Java 7)

 Remove java6-activated profile from tika-bundle and move its plugins to 
 default build
 -

 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 Since the project now requires Java 7, there's no point in allowing Java 6+ 
 since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1711) Remove java6-activated profile from tika-bundle and move its plugins to default build

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1711:
--
Attachment: TIKA-1711.patch

Revised patch for the revised purpose

 Remove java6-activated profile from tika-bundle and move its plugins to 
 default build
 -

 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1711.patch


 Since the project now requires Java 7, there's no point in allowing Java 6+ 
 since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1719) Utilize try-with-resources where it is trivial

2015-08-20 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1719:
-

 Summary: Utilize try-with-resources where it is trivial
 Key: TIKA-1719
 URL: https://issues.apache.org/jira/browse/TIKA-1719
 Project: Tika
  Issue Type: Improvement
  Components: cli, core, example, gui, packaging, parser, server
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


The following type of resource usages:
{code}
AutoCloseable resource = ...;
try {
// do something with resource
} finally {
resource.close();
}
{code}
{code}
AutoCloseable resource = null;
try {
resource = ...;
// do something with resource
} finally {
if (resource != null) {
resource.close();
}
}
{code}

and similar constructs can be trivially replaced with Java 7's 
try-with-resource statement:
{code}
try (AutoCloseable resource = ...) {
// do something with resource
}
{code}

This brings more concise code with less chance of causing resource leaks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-20 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705085#comment-14705085
 ] 

Yaniv Kunda commented on TIKA-1710:
---

As much as I like Guava (the library, not the fruit) its only use was its 
com.google.common.baseCharsets class, containing constants for the Charset 
instances of the standard charsets - same as in Java's StandardCharsets.
When I replaced this with the static imports of StandardCharsets, there was no 
use left.

Regarding TaggedInputStream, I wasn't sure what to do - this wrap/cast method 
was a modification of the original commons-io code, and it was used only once - 
in RFC822Parser.
I think it's a nice-to-have optimization helper method but nothing more - as it 
only saves the cost of a new TaggedInputStream when the source InputStream is 
already a TaggedInputStream: the checked tag will behave the same way in the 
same wrap-try-catch flow.
The only other usage of TaggedInputStream in tika (besides by TikaInputStream) 
is in RTFParser, by using the constructor directly, is actually an empty usage 
- the TaggedInputStream is constructed and checked in the catch clause, but it 
is not used in the try block at all: the underlying stream does!

Since almost all of tika uses TikaInputStream (which has an advanced version of 
this helper, ensuring bufferism), my opinion is to refrain from adding a helper 
method and simply use the constructor directly, for simplicity. 

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1719) Utilize try-with-resources where it is trivial

2015-08-20 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1719:
--
Attachment: TIKA-1719.patch

 Utilize try-with-resources where it is trivial
 --

 Key: TIKA-1719
 URL: https://issues.apache.org/jira/browse/TIKA-1719
 Project: Tika
  Issue Type: Improvement
  Components: cli, core, example, gui, packaging, parser, server
Reporter: Yaniv Kunda
Priority: Minor
  Labels: easyfix
 Fix For: 1.11

 Attachments: TIKA-1719.patch


 The following type of resource usages:
 {code}
 AutoCloseable resource = ...;
 try {
 // do something with resource
 } finally {
 resource.close();
 }
 {code}
 {code}
 AutoCloseable resource = null;
 try {
 resource = ...;
 // do something with resource
 } finally {
 if (resource != null) {
 resource.close();
 }
 }
 {code}
 and similar constructs can be trivially replaced with Java 7's 
 try-with-resource statement:
 {code}
 try (AutoCloseable resource = ...) {
 // do something with resource
 }
 {code}
 This brings more concise code with less chance of causing resource leaks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-17 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: (was: TIKA-1710.patch)

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-17 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: TIKA-1710.patch

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1711) Modify tika-bundle profile activation to require Java 7

2015-08-16 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1711:
-

 Summary: Modify tika-bundle profile activation to require Java 7
 Key: TIKA-1711
 URL: https://issues.apache.org/jira/browse/TIKA-1711
 Project: Tika
  Issue Type: Bug
  Components: general
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Since the project now requires Java 7, there's no point in allowing Java 6+ 
since the build would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-16 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: (was: TIKA-1710.patch)

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-16 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1710:
--
Attachment: TIKA-1710.patch

Revised patch without StandardCharsets wildcard static imports

 Replace usages of classes in org.apache.tika.io with current alternatives
 -

 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
 translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1710.patch


 Many of the classes in org.apache.tika.io were inlined from commons-io in 
 TIKA-249, but these days most components use commons-io anyway, so in order 
 to clean the dependencies on org.apache.tika.io in preparation of adding 
 commons-io to tika-core, the following can be done:
 - Replace usages of classes in org.apache.tika.io within non-core components 
 with the corresponding classes in commons-io
 - Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
 java.nio.charset.StandardCharsets.UTF_8 (in all components, including 
 tika-core)
 - Replace other uses of String encoding names of standard charsets with their 
 corresponding Charsets instances from StandardCharsets (this is logically 
 related to IOUtils as these constants should have been there as UTF_8 was 
 before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Comment: was deleted

(was: A patch to bring back commons-io to tika-core and replace all formerly 
inlined classes.)

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-15 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698477#comment-14698477
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I've separated all the related changes besides adding commons-io to tika-core, 
and opened under TIKA-1710.
In addition, the recently added commons-io-unsafe check have now found a couple 
of more default encoding usages:
tika-core:   src\main\java\org\apache\tika\Tika.java
tika-server: src\test\java\org\apache\tika\server\CXFTestBase.java


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1710) Replace usages of classes in org.apache.tika.io with current alternatives

2015-08-15 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1710:
-

 Summary: Replace usages of classes in org.apache.tika.io with 
current alternatives
 Key: TIKA-1710
 URL: https://issues.apache.org/jira/browse/TIKA-1710
 Project: Tika
  Issue Type: Improvement
  Components: batch, cli, core, example, gui, parser, server, 
translation
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


Many of the classes in org.apache.tika.io were inlined from commons-io in 
TIKA-249, but these days most components use commons-io anyway, so in order to 
clean the dependencies on org.apache.tika.io in preparation of adding 
commons-io to tika-core, the following can be done:
- Replace usages of classes in org.apache.tika.io within non-core components 
with the corresponding classes in commons-io
- Replace usages of org.apache.tika.io.IOUtils.UTF_8 with 
java.nio.charset.StandardCharsets.UTF_8 (in all components, including tika-core)
- Replace other uses of String encoding names of standard charsets with their 
corresponding Charsets instances from StandardCharsets (this is logically 
related to IOUtils as these constants should have been there as UTF_8 was 
before Java 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2015-08-14 Thread Yaniv Kunda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaniv Kunda updated TIKA-1706:
--
Attachment: TIKA-1706.patch

A patch to bring back commons-io to tika-core and replace all formerly inlined 
classes.

 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11

 Attachments: TIKA-1706.patch


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1706) Bring back commons-io to tika-core

2015-08-13 Thread Yaniv Kunda (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696025#comment-14696025
 ] 

Yaniv Kunda commented on TIKA-1706:
---

I agree that generally adding an external dependency to a core module might 
have an impact,
but consider that unlike tika-core, commons-io is a true low-level library:
it has no compile-time dependencies and is used by 2500 projects in maven 
central alone.

I believe that copying the code of another library, frozen in time (in this 
case since 2008), hinders innovation and reduces the chance that anyone will 
utilize new improvements and fixes in newer commons-io since:
# it is disconnected from tika and requires manual discovery and research (if 
commons-io is used as an external dependency it's easy to find deprecated 
methods and their replacements using static analysis)
# it requires manual maintenance of copying select classes/code

It's not easy summing more than 7 years of changes in common-io, but here are 
some beneficial changes I found along the way:
- Use org.apache.commons.io.output.ByteArrayOutputStream instead of 
java.io.ByteArrayOutputStream (this class is actually not that new, but can 
benefit many uses and save a lot of byte-copying) - this has been further 
improved by providing an optimized InputStream from a 
org.apache.commons.io.output.ByteArrayOutputStream (IO-137)
- Allow using Charset instead of String encoding (IO-318)
- Use StringBuilderWriter instead of StringWriter to avoid unnecessary 
synchronization (IO-140)

Obviously, I did not propose this change just for the sake of disrupting the 
peace, but I plan and started a series of patches to utilize newer commons-io, 
which will follow - each in its own issue - once and if commons-io is added as 
a dependency to tika-core.


 Bring back commons-io to tika-core
 --

 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


 TIKA-249 inlined select commons-io classes in order to simplify the 
 dependency tree and save some space.
 I believe these arguments are weaker nowadays due to the following concerns:
 - Most of the non-core modules already use commons-io, and since tika-core is 
 usually not used by itself, commons-io is already included with it
 - Since some modules use both tika-core and commons-io, it's not clear which 
 code should be used
 - Having the inlined classes causes more maintenance and/or technology debt 
 (which in turn causes more maintenance)
 - Newer commons-io code utilizes newer platform code, e.g. using Charset 
 objects instead of encoding names, being able to use StringBuilder instead of 
 StringBuffer, and so on.
 I'll be happy to provide a patch to replace usages of the inlined classes 
 with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1706) Bring back commons-io to tika-core

2015-08-12 Thread Yaniv Kunda (JIRA)
Yaniv Kunda created TIKA-1706:
-

 Summary: Bring back commons-io to tika-core
 Key: TIKA-1706
 URL: https://issues.apache.org/jira/browse/TIKA-1706
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Yaniv Kunda
Priority: Minor
 Fix For: 1.11


TIKA-249 inlined select commons-io classes in order to simplify the dependency 
tree and save some space.
I believe these arguments are weaker nowadays due to the following concerns:
- Most of the non-core modules already use commons-io, and since tika-core is 
usually not used by itself, commons-io is already included with it
- Since some modules use both tika-core and commons-io, it's not clear which 
code should be used
- Having the inlined classes causes more maintenance and/or technology debt 
(which in turn causes more maintenance)
- Newer commons-io code utilizes newer platform code, e.g. using Charset 
objects instead of encoding names, being able to use StringBuilder instead of 
StringBuffer, and so on.

I'll be happy to provide a patch to replace usages of the inlined classes with 
commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)