[jira] [Updated] (TIKA-1436) improvement to PDFParser

2016-03-28 Thread Stefano Fornari (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Fornari updated TIKA-1436:
--
Attachment: 0001-Improvment-as-described-in-https-issues.apache.org-j.patch

see comment on 20160328

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>    Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.13
>
> Attachments: 
> 0001-Improvment-as-described-in-https-issues.apache.org-j.patch, 
> ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1436) improvement to PDFParser

2016-03-28 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214083#comment-15214083
 ] 

Stefano Fornari edited comment on TIKA-1436 at 3/28/16 10:58 AM:
-

patch  0001-Improvment-as-described-in-https-issues.apache.org-j.patch


was (Author: stefanofornari):
see comment on 20160328

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>    Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.13
>
> Attachments: 
> 0001-Improvment-as-described-in-https-issues.apache.org-j.patch, 
> ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1436) improvement to PDFParser

2016-03-28 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214082#comment-15214082
 ] 

Stefano Fornari commented on TIKA-1436:
---

sorry, it took much more than I expected... however, here is a new patch. I 
realized I did not create the previous one correctly, and I have added some 
more testing code.
I have done it on the master HEAD from git.

Regarding your concern here: 
"I'm looking at the raw patch now (not applied), and I'm a bit concerned that 
there is special handling for catching and swallowing a WriteLimitReached 
within the PDFParser. I may be misunderstanding your proposal, but the nice 
thing about the exception was that it put the burden/opportunity on the client 
to handle it, and we didn't have to add catch blocks to every parser (this 
point was already made by Jukka)."

There are two main reasons:

1. the limit is in the ContentHandler and the Parser is the client of such 
functionality, which therefore should handle the condition
2. the condition is handled because expected: we want the parsing to be 
successful in the case the limit is reached so that the so far read content can 
be handled; but I am open to explore a different approach if anyone thinks a 
better way.


> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>    Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.13
>
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1436) improvement to PDFParser

2016-01-13 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097245#comment-15097245
 ] 

Stefano Fornari commented on TIKA-1436:
---

Thanks for the feedback Tim.
I'll work the trunk code and produce a new patch in the next days. I will 
address your question too.

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.12
>
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1436) improvement to PDFParser

2015-12-29 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074740#comment-15074740
 ] 

Stefano Fornari edited comment on TIKA-1436 at 12/30/15 7:44 AM:
-

Hi,
the conversation is what reported in the bug, in particular Jukka says "Yes, 
the pattern is a bit awkward and generally shouldn't be recommended as it uses 
an exception to control the flow of the program". Now, Yukka did not reply but 
I provided a patch that addresses all Yukka concerns... and no one objected. I 
do not know how to interpret it, but if this is not an agreement, I would ask 
other developers on the list what they think. At the end, I can even give up. 
The current code does not make much sense, frankly. If the dev team is fine 
with it, then be it, but given that my patch is no such intrusive and results 
in better code, I do not really see why all this resistance.
Let me know if you want me to provide a new patch or ask again on the list.


was (Author: stefanofornari):
Hi,
the conversation is what reported in the bug, in particular Jukka says "Yes, 
the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the program". Now, 
Yukka did not reply but I provided a patch that addresses all Yukka concerns... 
and no one objected. I do not know how to interpret it, but if this is not an 
agreement, I would ask other developers on the list what they think. At the 
end, I can even give up. The current code does not make much sense, frankly. If 
the dev team is fine with it, then be it, but given that my patch is no such 
intrusive and results in better code, I do not really see why all this 
resistance.
Let me know if you want me to provide a new patch or ask again on the list.

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>    Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.12
>
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1436) improvement to PDFParser

2015-12-29 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074740#comment-15074740
 ] 

Stefano Fornari commented on TIKA-1436:
---

Hi,
the conversation is what reported in the bug, in particular Jukka says "Yes, 
the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the program". Now, 
Yukka did not reply but I provided a patch that addresses all Yukka concerns... 
and no one objected. I do not know how to interpret it, but if this is not an 
agreement, I would ask other developers on the list what they think. At the 
end, I can even give up. The current code does not make much sense, frankly. If 
the dev team is fine with it, then be it, but given that my patch is no such 
intrusive and results in better code, I do not really see why all this 
resistance.
Let me know if you want me to provide a new patch or ask again on the list.

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>    Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.12
>
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1436) improvement to PDFParser

2015-09-02 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727654#comment-14727654
 ] 

Stefano Fornari commented on TIKA-1436:
---

any news?

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>    Reporter: Stefano Fornari
>  Labels: parser, pdf
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Apache Tika 1.8 Released

2015-04-22 Thread Stefano Fornari
congratulations and thanks!
Ste


On Tue, Apr 21, 2015 at 4:29 PM, Mattmann, Chris A (3980)
 wrote:
> Yay thanks Tyler!
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , "Timothy B." 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, April 21, 2015 at 8:34 AM
> To: "dev@tika.apache.org" 
> Subject: RE: [ANNOUNCE] Apache Tika 1.8 Released
>
>>Thank you, Tyler!
>>
>>-Original Message-
>>From: Tyler Palsulich [mailto:tpalsul...@apache.org]
>>Sent: Monday, April 20, 2015 5:09 PM
>>To: dev@tika.apache.org; u...@tika.apache.org; annou...@apache.org
>>Subject: [ANNOUNCE] Apache Tika 1.8 Released
>>
>>The Apache Tika project is pleased to announce the release of Apache Tika
>>1.8. The release
>>contents have been pushed out to the main Apache release site and to the
>>Maven Central sync, so the releases should be available as soon as the
>>mirrors get the syncs.
>>
>>Apache Tika is a toolkit for detecting and extracting metadata and
>>structured text content
>>from various documents using existing parser libraries.
>>
>>Apache Tika 1.8 contains a number of improvements and bug fixes. Details
>>can be found in the changes file:
>>http://www.apache.org/dist/tika/CHANGES-1.8.txt
>>
>>Apache Tika is available in source form from the following download page:
>>http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip
>>
>>Apache Tika is also available in binary form or for use using Maven 2 from
>>the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/
>>
>>In the initial 48 hours, the release may not be available on all mirrors.
>>When downloading from a mirror site, please remember to verify the
>>downloads using signatures found on the Apache site:
>>https://people.apache.org/keys/group/tika.asc
>>
>>For more information on Apache Tika, visit the project home page:
>>http://tika.apache.org/
>>
>>-- Tyler Palsulich, on behalf of the Apache Tika community
>


[jira] [Commented] (TIKA-1436) improvement to PDFParser

2015-02-07 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310638#comment-14310638
 ] 

Stefano Fornari commented on TIKA-1436:
---

ups, I did not notice this needed some background. As per the mentioned thread 
on the mailing list, which I am reporting below for your conveninece, I believe 
there was consensus that the current pattern is not the best and it is 
difficult to understand. I am not sure instead what you report about many not 
related changes in method/variables. I quickly had a look at the patch and I 
could not find any. can you please point it out?

thanks in advance,

> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).

Yes, the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the
program. However, in this case we considered it worth doing as the
alternative would have been far more complicated.

Basically we wanted to avoid having to modify each parser
implementation (even those implemented outside Tika...) to keep track
of how much content has already been extracted and instead do that
just once in the WriteOutContentHandler class. However, the only way
for the WriteOutContentHandler to signal that parsing should be
stopped is by throwing a SAXException, which is what we're doing here.
By catching the exception and inspecting it with isWriteLimitReached()
the client can determine whether this is what happened.

BR,

Jukka Zitting

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [PDFParser] - patch proposal

2014-10-05 Thread Stefano Fornari
done, thanks!

https://issues.apache.org/jira/browse/TIKA-1436

Ste

On Sun, Oct 5, 2014 at 6:40 PM, Tyler Palsulich 
wrote:

> Hi Stefano,
>
> Thank you for the patch and the reminder! Could you please create an issue
> on the TIKA JIRA [0]? Or, if this patch corresponds to a particular issue,
> attach your patch to that issue?
>
> Thank you!
> Tyler
>
> [0] https://issues.apache.org/jira/browse/TIKA
> On Oct 5, 2014 5:57 AM, "Stefano Fornari" 
> wrote:
>
> > hi, a friendly reminder to get a feedback on this.
> >
> >
> >
> > Ste
> >
> > On Sat, Sep 27, 2014 at 3:08 PM, Stefano Fornari <
> > stefano.forn...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > > with regards to the thread "[PDFParser] - read limited number of
> > > characters" on Mar 29, I would like to propose the attached patch. I
> > > noticed that in Tika 1.6 there have been some work around a better
> > handling
> > > of the WriteLimitReachedException condition, but I believe it could be
> > even
> > > improved.
> > >
> > > What do you think?
> > > Ste
> > >
> >
>


[jira] [Updated] (TIKA-1436) improvement to PDFParser

2014-10-05 Thread Stefano Fornari (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Fornari updated TIKA-1436:
--
Attachment: ste-20140927.patch

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>    Reporter: Stefano Fornari
>  Labels: parser, pdf
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1436) improvement to PDFParser

2014-10-05 Thread Stefano Fornari (JIRA)
Stefano Fornari created TIKA-1436:
-

 Summary: improvement to PDFParser
 Key: TIKA-1436
 URL: https://issues.apache.org/jira/browse/TIKA-1436
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Stefano Fornari


with regards to the thread "[PDFParser] - read limited number of characters" on 
Mar 29, I would like to propose the attached patch. I noticed that in Tika 1.6 
there have been some work around a better handling of the 
WriteLimitReachedException condition, but I believe it could be even improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [PDFParser] - patch proposal

2014-10-05 Thread Stefano Fornari
hi, a friendly reminder to get a feedback on this.



Ste

On Sat, Sep 27, 2014 at 3:08 PM, Stefano Fornari 
wrote:

> Hi All,
> with regards to the thread "[PDFParser] - read limited number of
> characters" on Mar 29, I would like to propose the attached patch. I
> noticed that in Tika 1.6 there have been some work around a better handling
> of the WriteLimitReachedException condition, but I believe it could be even
> improved.
>
> What do you think?
> Ste
>


[PDFParser] - patch proposal

2014-09-27 Thread Stefano Fornari
Hi All,
with regards to the thread "[PDFParser] - read limited number of
characters" on Mar 29, I would like to propose the attached patch. I
noticed that in Tika 1.6 there have been some work around a better handling
of the WriteLimitReachedException condition, but I believe it could be even
improved.

What do you think?
Ste
Index: tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java
===
--- tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java	(revision 1627940)
+++ tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java	(working copy)
@@ -50,6 +50,11 @@
 private int writeCount = 0;
 
 /**
+ * Flag to mark if the limit has been reached
+ */
+private boolean writeLimitReached = false;
+
+/**
  * Creates a content handler that writes content up to the given
  * write limit to the given content handler.
  *
@@ -138,6 +143,7 @@
 } else {
 super.characters(ch, start, writeLimit - writeCount);
 writeCount = writeLimit;
+writeLimitReached = true;
 throw new WriteLimitReachedException(
 "Your document contained more than " + writeLimit
 + " characters, and so your requested limit has been"
@@ -156,6 +162,7 @@
 } else {
 super.ignorableWhitespace(ch, start, writeLimit - writeCount);
 writeCount = writeLimit;
+writeLimitReached = true;
 throw new WriteLimitReachedException(
 "Your document contained more than " + writeLimit
 + " characters, and so your requested limit has been"
@@ -173,31 +180,26 @@
  * @param t throwable
  * @return true if the write limit was reached,
  * false otherwise
+ * 
+ * Deprecated in Tika 1.6, use isWriteLimitReached(); the current 
+ * implementation ignores the given Throwable and is equivalent to 
+ * isWriteLimitReached()
+ * 
  */
+@Deprecated
 public boolean isWriteLimitReached(Throwable t) {
-if (t instanceof WriteLimitReachedException) {
-return tag.equals(((WriteLimitReachedException) t).tag);
-} else {
-return t.getCause() != null && isWriteLimitReached(t.getCause());
-}
+return isWriteLimitReached();
 }
-
+
 /**
- * The exception used as a signal when the write limit has been reached.
+ * Returns true if the limit has been reached, false otherwise.
+ *
+ * @since Apache Tika 1.6
+ * @return true if the write limit was reached,
+ * false otherwise
  */
-private static class WriteLimitReachedException extends SAXException {
-
-/** Serial version UID */
-private static final long serialVersionUID = -1850581945459429943L;
-
-/** Serializable tag of the handler that caused this exception */
-private final Serializable tag;
-
-public WriteLimitReachedException(String message, Serializable tag) {
-   super(message);
-   this.tag = tag;
-}
-
+public boolean isWriteLimitReached() {
+return writeLimitReached;
 }
 
 }
Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
===
--- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java	(revision 1627940)
+++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java	(working copy)
@@ -52,6 +52,7 @@
 import org.apache.tika.parser.AbstractParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.parser.PasswordProvider;
+import org.apache.tika.sax.WriteLimitReachedException;
 import org.xml.sax.ContentHandler;
 import org.xml.sax.SAXException;
 
@@ -157,7 +158,13 @@
 metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
 extractMetadata(pdfDocument, metadata);
 if (handler != null) {
-PDF2XHTML.process(pdfDocument, handler, context, metadata, localConfig);
+try {
+PDF2Text.process(pdfDocument, handler, context, metadata, localConfig);
+} catch (WriteLimitReachedException x) {
+//
+// This is a valid condition; just ignoring the exception
+//
+}
 }
 
 } finally {
Index: tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
===
--- tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java	(revision 1627940)
+++ tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java	(working copy)
@@ -144,21 +144,21 @@
  * 
  * @param pdf2XHTML
  */
-public void configure(PDF2XHTML pdf2XHTML) {
-pdf2XHTML.s

Re: [PDFParser] - read limited number of characters

2014-04-02 Thread Stefano Fornari
Hi Jukka,
any feedbacks?

Ste


On Sat, Mar 29, 2014 at 3:31 PM, Stefano Fornari
wrote:

> Hi Jukka, given we agree the pattern is not very nice, would you be ok to
> hide it to client classes? I digged a bit more in the code and I found all
> we need was already there. This is what I would propose:
>
> 1 promote WriteLimitReachedException to a public class
> 2 move the awkward trick into PDFParser as follows:
>
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> extractMetadata(pdfDocument, metadata);
> try {
>  PDF2XText.process(pdfDocument, handler, context, metadata,
> localConfig);
> } catch (WriteLimitReachedException x) {
>   //
>   // This is a valid condition; just ignoring the exception
>   //
> }
>
> In this way the only think client classes should do is to use a limiting
> BodyContentHandler:
>
> @Test
> public void testLimitTextToParse() throws Exception {
> ContentHandler handler = new BodyContentHandler();
>
> new PDFParser().parse(
> getResourceAsStream("/test-documents/testPDF.pdf"),
> handler,
> new Metadata(),
> new ParseContext()
> );
>
> assertEquals(1067, handler.toString().length());
>
> handler = new BodyContentHandler(500);
>
> new PDFParser().parse(
> getResourceAsStream("/test-documents/testPDF.pdf"),
> handler,
> new Metadata(),
> new ParseContext()
> );
>
> assertEquals(500, handler.toString().length());
> }
>
>
> One additional thing I would do is to change WriteOutContentHandler as per
> the below:
>
> /**
>  * Writes the given characters to the given character stream.
>  */
> @Override
> public void characters(char[] ch, int start, int length)
> throws SAXException {
> if (writeLimit == -1 || writeCount + length <= writeLimit) {
> super.characters(ch, start, length);
> writeCount += length;
> } else {
> super.characters(ch, start, writeLimit - writeCount);
> writeCount = writeLimit;
> writeLimitReached = true;
> throw new WriteLimitReachedException(
> "Your document contained more than " + writeLimit
> + " characters, and so your requested limit has been"
> + " reached. To receive the full text of the document,"
> + " increase your limit. (Text up to the limit is"
> + " however available).", tag);
> }
> }
>
> /**
>  * Checks whether the given exception (or any of it's root causes) was
>  * thrown by this handler as a signal of reaching the write limit.
>  *
>  * @since Apache Tika 0.7
>  * @param t throwable
>  * @return true if the write limit was reached,
>  * false otherwise
>  *
>  * Deprecated in Tika 1.6, use isWriteLimitReached(); the current
>  * implementation ignores the given Throwable and is equivalent to
>  * isWriteLimitReached()
>  *
>  */
> @Deprecated
> public boolean isWriteLimitReached(Throwable t) {
> return isWriteLimitReached();
> }
>
> /**
>  * Returns true if the limit has been reached, false otherwise.
>  *
>  * @since Apache Tika 1.6
>  * @return true if the write limit was reached,
>  * false otherwise
>  */
> public boolean isWriteLimitReached() {
> return writeLimitReached;
> }
>
>
> If you are ok with the changes for #1 and #2 I will be happy to provide a
> patch.
>
> Ste
>
>
>
>> > On #2, I expected the code you presented would not work. And in fact the
>> > pattern is quite odd, isn't it? What is the reason of throwing the
>> > exception if limiting the text read is a legal use case? (I am asking
>> just
>> > to understand the background).
>>
>> Yes, the pattern is a bit awkward and generally shouldn't be
>> recommended as it uses an exception to control the flow of the
>> program. However, in this case we considered it worth doing as the
>> alternative would have been far more complicated.
>>
>> Basically we wanted to avoid having to modify each parser
>> implementation (even those implemented outside Tika...) to keep track
>> of how much content has already been extracted and instead do that
>> just once in the WriteOutContentHandler class. However, the only way
>> for the WriteOutContentHandler to signal that parsing should be
>> stopped is by throwing a SAXException, which is what we're doing here.
>> By catching the exception and inspecting it with isWriteLimitReached()
>> the client can determine whether this is what happened.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>


[PDFParser] - read limited number of characters

2014-03-29 Thread Stefano Fornari
Hi Jukka, given we agree the pattern is not very nice, would you be ok to
hide it to client classes? I digged a bit more in the code and I found all
we need was already there. This is what I would propose:

1 promote WriteLimitReachedException to a public class
2 move the awkward trick into PDFParser as follows:

metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
extractMetadata(pdfDocument, metadata);
try {
 PDF2XText.process(pdfDocument, handler, context, metadata,
localConfig);
} catch (WriteLimitReachedException x) {
  //
  // This is a valid condition; just ignoring the exception
  //
}

In this way the only think client classes should do is to use a limiting
BodyContentHandler:

@Test
public void testLimitTextToParse() throws Exception {
ContentHandler handler = new BodyContentHandler();

new PDFParser().parse(
getResourceAsStream("/test-documents/testPDF.pdf"),
handler,
new Metadata(),
new ParseContext()
);

assertEquals(1067, handler.toString().length());

handler = new BodyContentHandler(500);

new PDFParser().parse(
getResourceAsStream("/test-documents/testPDF.pdf"),
handler,
new Metadata(),
new ParseContext()
);

assertEquals(500, handler.toString().length());
}


One additional thing I would do is to change WriteOutContentHandler as per
the below:

/**
 * Writes the given characters to the given character stream.
 */
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if (writeLimit == -1 || writeCount + length <= writeLimit) {
super.characters(ch, start, length);
writeCount += length;
} else {
super.characters(ch, start, writeLimit - writeCount);
writeCount = writeLimit;
writeLimitReached = true;
throw new WriteLimitReachedException(
"Your document contained more than " + writeLimit
+ " characters, and so your requested limit has been"
+ " reached. To receive the full text of the document,"
+ " increase your limit. (Text up to the limit is"
+ " however available).", tag);
}
}

/**
 * Checks whether the given exception (or any of it's root causes) was
 * thrown by this handler as a signal of reaching the write limit.
 *
 * @since Apache Tika 0.7
 * @param t throwable
 * @return true if the write limit was reached,
 * false otherwise
 *
 * Deprecated in Tika 1.6, use isWriteLimitReached(); the current
 * implementation ignores the given Throwable and is equivalent to
 * isWriteLimitReached()
 *
 */
@Deprecated
public boolean isWriteLimitReached(Throwable t) {
return isWriteLimitReached();
}

/**
 * Returns true if the limit has been reached, false otherwise.
 *
 * @since Apache Tika 1.6
 * @return true if the write limit was reached,
 * false otherwise
 */
public boolean isWriteLimitReached() {
return writeLimitReached;
}


If you are ok with the changes for #1 and #2 I will be happy to provide a
patch.

Ste



> > On #2, I expected the code you presented would not work. And in fact the
> > pattern is quite odd, isn't it? What is the reason of throwing the
> > exception if limiting the text read is a legal use case? (I am asking
> just
> > to understand the background).
>
> Yes, the pattern is a bit awkward and generally shouldn't be
> recommended as it uses an exception to control the flow of the
> program. However, in this case we considered it worth doing as the
> alternative would have been far more complicated.
>
> Basically we wanted to avoid having to modify each parser
> implementation (even those implemented outside Tika...) to keep track
> of how much content has already been extracted and instead do that
> just once in the WriteOutContentHandler class. However, the only way
> for the WriteOutContentHandler to signal that parsing should be
> stopped is by throwing a SAXException, which is what we're doing here.
> By catching the exception and inspecting it with isWriteLimitReached()
> the client can determine whether this is what happened.
>
> BR,
>
> Jukka Zitting
>


[PDFParser] XHTML vs plain text (was Re: PDF parser (two more questions))

2014-03-29 Thread Stefano Fornari
Hi Jukka,
I am splitting the thread.

Thanks to your explanation and playing with the code I understood better
how it works: basically it uses a SAX builder, than it depends by the
builder to add or not the XHTML markup. BodyContentHandler does not add the
markup -> plain text; ToXMLContentHandler adds the markcup -> XHTML.


Being that the case, the name PDF2XHTML is misleading, isn't it? Would you
be ok to change it into PDF2Text (as per text/plain or text/html)? it's a
package class, thus changing the name should not be an issue.


Ste


On Fri, Mar 28, 2014 at 3:42 PM, Jukka Zitting wrote:

> Hi,
>
> On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari
>  wrote:
> > On #1 I am still wondering why for indexing we need structure
> information.
> > is there any particular reason? wouldn't make more sense to get just the
> > text by default and only optionally getting the structure?
>
> The trouble is that then each parser would need to have code for
> producing both text and XHTML. Since the overhead of producing XHTML
> instead of just text is pretty low, and since it's very easy for
> clients that only care about the text output to just strip out the
> markup, it made more sense to design the system to always produce
> XHTML.
>
> The same applies for document metadata. All parsers produce as much
> metadata as they can, but must clients will just ignore most or all of
> the returned metadata fields. However, since the overhead of producing
> all the information is lower than that of adding explicit options to
> control which metadata needs to be extracted and returned, it makes
> sense to to just let clients filter out those bits that they don't
> care about.
>
>


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
well, I should look at the code, I can't do it now, but I guess my point is
that BodyContentHandler should not throw the exception (and most probably
not a SAXException in any case) in the case the limit is reached. This
means that the limit should not put on the WriteOutContentHandler, but on
BodyContentHandler.

Ste


On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov wrote:

> SAXException is checked, so you have to catch it or add to method throws
> list (or javac wouldn't compile it). Tika usually rethrows exceptions
> enveloping them into TikaException. In case of code above method throws
> SAXException.
>
> Suppressing the exception is done to avoid parser fail after parsing
> valuable amount of data.
>
> --
> Best regards,
> Konstantin Gribov.
> 28.03.2014 14:27 пользователь "Stefano Fornari"  >
> написал:
>
> > On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari <
> > stefano.forn...@gmail.com
> > > wrote:
> >
> > > I understood the trick, but I am trying to understand this is done in
> > this
> > > way (that at a first glance does not seem clean).
> > >
> > > ... trying to understand why this is done in this way...
> >
>


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari  wrote:

> I understood the trick, but I am trying to understand this is done in this
> way (that at a first glance does not seem clean).
>
> ... trying to understand why this is done in this way...


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Yes, got it. Which is a strange use case: if I set the limit, first I would
not expect an exception (which represents an unexpected error condition);
secondly, I would not expect to rethrow it only under certain conditions. I
understood the trick, but I am trying to understand this is done in this
way (that at a first glance does not seem clean).


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Hi Jukka,
thanks a lot for your reply.

On #1 I am still wondering why for indexing we need structure information.
is there any particular reason? wouldn't make more sense to get just the
text by default and only optionally getting the structure?

On #2, I expected the code you presented would not work. And in fact the
pattern is quite odd, isn't it? What is the reason of throwing the
exception if limiting the text read is a legal use case? (I am asking just
to understand the background).

Ste

Ste


On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
>  wrote:
> > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an
> XHTML?
> > for the purpose of indexing, wouldn't just the text be enough?
>
> The XHTML output allows us to annotate the extracted text with
> structural information (like "this is a heading", "here's a
> hyperlink", etc.) that would be difficult to express with text-only
> output. A client that needs just the text content can get it easily
> with the BodyContentHandler class.
>
> > 2. I need to limit the index of the content to files whose size is below
> to
> > a certain threshold; I was wondering if this could be a parser
> > configuration option and thus if you would accept this change.
>
> Do you want to entirely exclude too large files, or just index the
> first few pages of such files (which is more common in many indexing
> use cases)?
>
> The latter use case be implemented with the writeLimit parameter of
> the WriteOutContentHandler class, like this:
>
> // Extract up to 100k characters from a given document
> WriteOutContentHandler out = new WriteOutContentHandler(100_000);
> try {
> parser.parse(..., new BodyContentHandler(out), ...);
> } catch (SAXException e) {
> if (!out.isWriteLimitReached(e)) {
> throw e;
> }
> }
> String content = out.toString();
>
> BR,
>
> Jukka Zitting
>


Re: Parser.parse with file instead of stream

2014-03-27 Thread Stefano Fornari
that worked! thanks.

Ste


On Thu, Mar 27, 2014 at 11:24 PM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Mar 27, 2014 at 6:07 PM, Stefano Fornari
>  wrote:
> > I am not sure tstream.hasFile() can ever be true, from my understanding
> of
> > the code it can be only false.
>
> It's true if you call the parser like this:
>
> InputStream stream = TikaInputStream.get(file);
> try {
> parser.parse(stream, ...);
> } finally {
> stream.close();
> }
>
> > What do you think about extending the Parse interface accordingly?
>
> See https://issues.apache.org/jira/browse/TIKA-153 (and the
> TikaInputStream javadocs) for details on how we already achieve this
> functionality.
>
> BR,
>
> Jukka Zitting
>


PDF parser (two more questions)

2014-03-27 Thread Stefano Fornari
Hi,
I have two more questions on PDFParser:

1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML?
for the purpose of indexing, wouldn't just the text be enough?
2. I need to limit the index of the content to files whose size is below to
a certain threshold; I was wondering if this could be a parser
configuration option and thus if you would accept this change.

Thanks in advance,
Ste


Parser.parse with file instead of stream

2014-03-27 Thread Stefano Fornari
Hi All,
I am using lucene in an embedded environment and I need to keep use of
memory under control. In investigating a problem with big pdf files (a few
Mb), I noticed that Parse.parse takes an InputStream as parameter but then
PDFParser has the following code:

TikaInputStream tstream = TikaInputStream.cast(stream);
if (tstream != null && tstream.hasFile()) {
// File based, take that as a cue to use a temporary file
RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
if (localConfig.getUseNonSequentialParser() == true){
pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(stream), scratchFile);
} else {
pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
}
} else {
// Go for the normal, stream based in-memory parsing
if (localConfig.getUseNonSequentialParser() == true){
pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(stream), new RandomAccessBuffer());
} else {
pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
}
}

I am not sure tstream.hasFile() can ever be true, from my understanding of
the code it can be only false. Therefore the "else" triggers and the stream
is managed in memory. I suspect this means the stream (or a good part of
it) is read in memory somewhere when managed, potentially using a lot of
memory.

I have then tried a different approach, adding a version of parse() that
accepts a file instead of a stream. The code above will then become:

TikaInputStream tstream = TikaInputStream.get(file);
if (tstream != null && tstream.hasFile()) {
// File based, take that as a cue to use a temporary file
RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
if (localConfig.getUseNonSequentialParser() == true){
pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(tstream), scratchFile);
} else {
pdfDocument = PDDocument.load(new
CloseShieldInputStream(tstream), scratchFile, true);
}
} else {
// Go for the normal, stream based in-memory parsing
if (localConfig.getUseNonSequentialParser() == true){
pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(tstream), new RandomAccessBuffer());
} else {
pdfDocument = PDDocument.load(new
CloseShieldInputStream(tstream), true);
}
}

(but do we really need the && in the if?)

This is much more friendly with memory usage; with the first version of the
method I could not parse a file of 4.3Mb running the JVM with 16M while I
have parsed it successfully with the second approach.

What do you think about extending the Parse interface accordingly? would
you be interested in a patch that does it?

Ste


Re: [ANNOUNCE] Apache Tika 1.5 Released

2014-02-21 Thread Stefano Fornari
Congratulations Dave and team!

Ste


On Wed, Feb 19, 2014 at 11:18 PM, David Meikle  wrote:

> The Apache Tika project is pleased to announce the release of Apache Tika
> 1.5. The release contents have been pushed out to the main Apache release
> site and to the Maven Central sync, so the releases should be available as
> soon as the mirrors get the syncs.
>
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
>
> Apache Tika 1.5 contains a number of improvements and bug fixes. Details
> can be found in the changes file:
> http://www.apache.org/dist/tika/CHANGES-1.5.txt
>
> Apache Tika is available in source form from the following download page:
> http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.5-src.zip
>
> Apache Tika is also available in binary form or for use using Maven 2 from
> the Central Repository:
> http://repo1.maven.org/maven2/org/apache/tika/
>
> In the initial 48 hours, the release may not be available on all mirrors.
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found on the Apache site:
> https://people.apache.org/keys/group/tika.asc
>
> For more information on Apache Tika, visit the project home page:
> http://tika.apache.org/
>
> -- Dave Meikle, on behalf of the Apache Tika community
>
>


Re: Passing to FEST for JUnit tests ?

2014-01-20 Thread Stefano Fornari
I actually understand a more open fork is available; it seems to be more
active than fest: assertj https://github.com/joel-costigliola/assertj-core

HTH
Ste

Ste


On Mon, Jan 20, 2014 at 10:10 AM, Hong-Thai Nguyen <
hong-thai.ngu...@polyspot.com> wrote:

> Just syntax is much more fluent, nothing change with your IDE.
> More about Fest vs Junit:
> http://maciejwalkowiak.pl/blog/2012/03/23/better-unit-tests-with-fest-assert/
>
>
> Hong-Thai
>
>
> -Message d'origine-
> De : Konstantin Gribov [mailto:gros...@gmail.com]
> Envoyé : samedi 18 janvier 2014 09:01
> À : dev@tika.apache.org
> Objet : Re: Passing to FEST for JUnit tests ?
>
> Does it give something more than just fluent interface? Does it integrate
> to IDEs as good as JUnit?
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> 2014/1/17 Hong-Thai Nguyen 
>
> > Dear all,
> >
> > Fest (https://code.google.com/p/fest/ ) syntax is much intuitive than
> > JUnit :
> > assertEquals("result", gettingResult());
> > by
> > assertThat(gettingResult(), is("result"));
> >
> > We may replace progressively in our tests.
> >
> > Hong-Thai
> >
> >
>


Re: Passing to FEST for JUnit tests ?

2014-01-18 Thread Stefano Fornari
I did not know about it and I gave it a try just last night. It integrates
perfectly in Netbeans (same as JUnit) and it syntax is very convenient,
much more than JUnit. Relpacing JUnit is a blink (I believe it can even be
done with a script).

my 2 cents
ste

Ste


On Sat, Jan 18, 2014 at 9:00 AM, Konstantin Gribov wrote:

> Does it give something more than just fluent interface? Does it integrate
> to IDEs as good as JUnit?
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> 2014/1/17 Hong-Thai Nguyen 
>
> > Dear all,
> >
> > Fest (https://code.google.com/p/fest/ ) syntax is much intuitive than
> > JUnit :
> > assertEquals("result", gettingResult());
> > by
> > assertThat(gettingResult(), is("result"));
> >
> > We may replace progressively in our tests.
> >
> > Hong-Thai
> >
> >
>


Re: [DISCUSS] Prepare Release 1.5?

2014-01-14 Thread Stefano Fornari
Hi Dave,
I am fairly new to the community, but I'll provide my feedback anyway :)
Currently, tika 1.4 has some serious bug that makes it hang with partial
mp3, so it can be quite bad in production. tika 1.5 fixes it, but I do
understand TIKA-1198is a bad regression, therefore it is blocker for me
too. I am not familiar with WS so I do not know how much work would be to
fix it. however, I am wondering if no one commit to fix it, is roll back an
option? we may roll back the CXf fix and then be ready to release.

Thoughts?

Ste


On Thu, Jan 9, 2014 at 12:45 PM, Chris Mattmann  wrote:

> Hey Dave,
>
> I kind of got bogged down and haven't had time to release. If someone
> else does have time and wants to pick this up, +1 for it!
>
> Cheers,
> Chris
>
>
>
>
> -Original Message-
> From: David Meikle 
> Reply-To: "dev@tika.apache.org" 
> Date: Thursday, January 9, 2014 3:46 AM
> To: "dev@tika.apache.org" 
> Subject: Re: [DISCUSS] Prepare Release 1.5?
>
> >Hi,
> >
> >On 29 Dec 2013, at 11:41, David Meikle  wrote:
> >
> >> Hi Guys,
> >>
> >> There have been some questions pop up around when a new 1.5 release
> >>will be available.
> >>
> >> I have some free cycles over the next couple of weeks to prepare one
> >>and I believe Chris has some too, so in preparation for that what do we
> >>need to do to make the current trunk releasable as version 1.5?
> >>
> >> For me the following issue need to be fixed before release:
> >> TIKA-1198 - the change to using multi-parts appears to have broken our
> >>current guidance on usage significantly.
> >>
> >> Is there anything else others think is a must before rolling a release?
> >>
> >> I was also thinking we could do some quick work to include the
> >>following issues:
> >> TIKA-1059
> >> TIKA-985, TIKA-980
> >>
> >> I don¹t want to hold things up, so if we sort peoples mandatories I
> >>think we should roll a release.
> >>
> >> @Chris - I know you had free cycles and volunteered so will defer to
> >>you on the release management side of things.  That said happy to take
> >>it on if that helps.
> >>
> >> Cheers,
> >> Dave
> >
> >Conscious it was the festive period of late, so wondering if anyone has
> >had further thoughts on this?
> >
> >Cheers,
> >Dave
>
>
>


[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-13 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870125#comment-13870125
 ] 

Stefano Fornari commented on TIKA-1078:
---

Hi Michael,
thanks for the review. I took into account all your comments. About the 
directory structure, I reverted my change now that I understood better the 
original behaviour. I think the original behaviour is cleaner and nicer.

attaching the new patch.


> TikaCLI: invalid characters in embedded document name causes FNFE when trying 
> to save
> -
>
> Key: TIKA-1078
> URL: https://issues.apache.org/jira/browse/TIKA-1078
> Project: Tika
>  Issue Type: Bug
>  Components: cli, parser
>Reporter: Michael McCandless
> Fix For: 1.5
>
> Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, 
> tika-1078.patch
>
>
> Attached document hits this on Windows:
> {noformat}
> C:\>java.exe -jar tika-app-1.3.jar -z -x 
> c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
> Extracting 'file0.png' (image/png) to .\file0.png
> Extracting 'file1.emf' (application/x-emf) to .\file1.emf
> Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
> Extracting 'file3.emf' (application/x-emf) to .\file3.emf
> Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
> Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
> .\MBD0016BDE4\?£☺.bin
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@75f875f8
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
> filename, directory name, or volume label syntax is incorrect.)
> at java.io.FileOutputStream.(FileOutputStream.java:205)
> at java.io.FileOutputStream.(FileOutputStream.java:156)
> at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
> at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
> at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat}
> TikaCLI manages to create the sub-directory, but because the embedded 
> fileName has invalid (for Windows) characters, it fails.
> On Linux it runs fine.
> I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-13 Thread Stefano Fornari (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Fornari updated TIKA-1078:
--

Attachment: tika-1078-2.patch

> TikaCLI: invalid characters in embedded document name causes FNFE when trying 
> to save
> -
>
> Key: TIKA-1078
> URL: https://issues.apache.org/jira/browse/TIKA-1078
> Project: Tika
>  Issue Type: Bug
>  Components: cli, parser
>Reporter: Michael McCandless
> Fix For: 1.5
>
> Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, 
> tika-1078.patch
>
>
> Attached document hits this on Windows:
> {noformat}
> C:\>java.exe -jar tika-app-1.3.jar -z -x 
> c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
> Extracting 'file0.png' (image/png) to .\file0.png
> Extracting 'file1.emf' (application/x-emf) to .\file1.emf
> Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
> Extracting 'file3.emf' (application/x-emf) to .\file3.emf
> Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
> Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
> .\MBD0016BDE4\?£☺.bin
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@75f875f8
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
> filename, directory name, or volume label syntax is incorrect.)
> at java.io.FileOutputStream.(FileOutputStream.java:205)
> at java.io.FileOutputStream.(FileOutputStream.java:156)
> at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
> at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
> at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat}
> TikaCLI manages to create the sub-directory, but because the embedded 
> fileName has invalid (for Windows) characters, it fails.
> On Linux it runs fine.
> I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: TIKA-1078

2014-01-12 Thread Stefano Fornari
On Sun, Jan 12, 2014 at 3:14 PM, Stefano Fornari
wrote:

> Hi All,
> attached the patch. See https://issues.apache.org/jira/browse/TIKA-1078for 
> some more details.
> Indeed with this I intend to release the right to use the code for any
> purpose.
>
> Let me know if it is ok, or anything can be improved.
> Regards,
>
> Ste
>
>
> On Sun, Jan 12, 2014 at 11:07 AM, Stefano Fornari <
> stefano.forn...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'd like to fix this one as a way to get familiar with tika.
>> I have a couple of questions:
>>
>> 1. As far as I understand it (and based on the tests I have done) the
>> problem here is with special characters not allowed in file names by the
>> different file systems, not to special (i.e. not ASCII or UTF8) characters.
>> can anyone confirm?
>> 2. Is there any general policy in tika development I should follow wrt
>> java version? shall I stick to a particular version of java, or can I go
>> with Java 7?
>>
>>
>> --
>> Ste
>>
>
>
>


-- 
Ste
Index: tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
===
--- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java	(revision 1557531)
+++ tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java	(working copy)
@@ -91,6 +91,7 @@
 import org.xml.sax.SAXException;
 import org.xml.sax.helpers.DefaultHandler;
 import com.google.gson.Gson;
+import org.apache.tika.io.FilenameUtils;
 
 /**
  * Simple command line interface for Apache Tika.
@@ -712,11 +713,10 @@
 name = relID + "_" + name;
 }
 
-File outputFile = new File(extractDir, name);
-File parent = outputFile.getParentFile();
-if (!parent.exists()) {
-if (!parent.mkdirs()) {
-throw new IOException("unable to create directory \"" + parent + "\"");
+File outputFile = new File(extractDir, FilenameUtils.normalize(name));
+if (!extractDir.exists()) {
+if (!extractDir.mkdirs()) {
+throw new IOException("unable to create directory \"" + extractDir + "\"");
 }
 }
 System.out.println("Extracting '"+name+"' ("+contentType+") to " + outputFile);
@@ -740,7 +740,16 @@
 IOUtils.copy(inputStream, os);
 }
 } catch (Exception e) {
-logger.warn("Ignoring unexpected exception trying to save embedded file " + name, e);
+//
+// being a CLI program messages should go to the stderr too
+//
+String msg = String.format(
+"Ignoring unexpected exception trying to save embedded file %s (%s)",
+name,
+e.getMessage()
+);
+System.err.println(msg);
+logger.warn(msg, e);
 } finally {
 if (os != null) {
 os.close();
Index: tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java
===
--- tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java	(revision 0)
+++ tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java	(working copy)
@@ -0,0 +1,68 @@
+/*
+ * Copyright 2014 The Apache Software Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.io;
+
+/**
+ *
+ * @author ste
+ */
+public class FilenameUtils {
+
+/**
+ * Reserved characters
+ */
+public final static char[] RESERVED_FILENAME_CHARACTERS = {
+0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
+0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
+'?', '/', '\\', ':', '*', '<', '>', '|'
+};
+
+private final static String RESERVED = new String(RESERVED_FILENAME_CHARACTERS, 32, 8);
+
+/**
+ * Scans the given

Re: TIKA-1078

2014-01-12 Thread Stefano Fornari
Hi All,
attached the patch. See https://issues.apache.org/jira/browse/TIKA-1078 for
some more details.
Indeed with this I intend to release the right to use the code for any
purpose.

Let me know if it is ok, or anything can be improved.
Regards,

Ste

On Sun, Jan 12, 2014 at 11:07 AM, Stefano Fornari  wrote:

> Hi All,
>
> I'd like to fix this one as a way to get familiar with tika.
> I have a couple of questions:
>
> 1. As far as I understand it (and based on the tests I have done) the
> problem here is with special characters not allowed in file names by the
> different file systems, not to special (i.e. not ASCII or UTF8) characters.
> can anyone confirm?
> 2. Is there any general policy in tika development I should follow wrt
> java version? shall I stick to a particular version of java, or can I go
> with Java 7?
>
>
> --
> Ste
>


[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-12 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869053#comment-13869053
 ] 

Stefano Fornari commented on TIKA-1078:
---

I have the patch ready. I can not find a way to attach it here, I am posting it 
to the dev list. I followed a more conservative approach so that the characters 
that may be reserved by a operating system or file systems are turned into an 
hex code. This because this is transparent to all platforms and the behaviour 
will be the same on all platform.

> TikaCLI: invalid characters in embedded document name causes FNFE when trying 
> to save
> -
>
> Key: TIKA-1078
> URL: https://issues.apache.org/jira/browse/TIKA-1078
> Project: Tika
>  Issue Type: Bug
>  Components: cli, parser
>Reporter: Michael McCandless
> Fix For: 1.5
>
> Attachments: T-DS_Excel2003-PPT2003_1.xls
>
>
> Attached document hits this on Windows:
> {noformat}
> C:\>java.exe -jar tika-app-1.3.jar -z -x 
> c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
> Extracting 'file0.png' (image/png) to .\file0.png
> Extracting 'file1.emf' (application/x-emf) to .\file1.emf
> Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
> Extracting 'file3.emf' (application/x-emf) to .\file3.emf
> Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
> Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
> .\MBD0016BDE4\?£☺.bin
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@75f875f8
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
> filename, directory name, or volume label syntax is incorrect.)
> at java.io.FileOutputStream.(FileOutputStream.java:205)
> at java.io.FileOutputStream.(FileOutputStream.java:156)
> at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
> at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
> at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat}
> TikaCLI manages to create the sub-directory, but because the embedded 
> fileName has invalid (for Windows) characters, it fails.
> On Linux it runs fine.
> I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


TIKA-1078

2014-01-12 Thread Stefano Fornari
Hi All,

I'd like to fix this one as a way to get familiar with tika.
I have a couple of questions:

1. As far as I understand it (and based on the tests I have done) the
problem here is with special characters not allowed in file names by the
different file systems, not to special (i.e. not ASCII or UTF8) characters.
can anyone confirm?
2. Is there any general policy in tika development I should follow wrt java
version? shall I stick to a particular version of java, or can I go with
Java 7?


-- 
Ste


[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-12 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868987#comment-13868987
 ] 

Stefano Fornari commented on TIKA-1078:
---

I'd like to fix this one as a way to get familiar with tika.
I have a couple of questions:

1. As far as I understand it (and based on the tests I have done) the problem 
here is with special characters not allowed in file names by the different file 
systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm?
2. Is there any general policy in tika development I should follow wrt java 
version? shall I stick to a particular version of java, or can I go with Java 7?



> TikaCLI: invalid characters in embedded document name causes FNFE when trying 
> to save
> -
>
> Key: TIKA-1078
> URL: https://issues.apache.org/jira/browse/TIKA-1078
> Project: Tika
>  Issue Type: Bug
>  Components: cli, parser
>Reporter: Michael McCandless
> Fix For: 1.5
>
> Attachments: T-DS_Excel2003-PPT2003_1.xls
>
>
> Attached document hits this on Windows:
> {noformat}
> C:\>java.exe -jar tika-app-1.3.jar -z -x 
> c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
> Extracting 'file0.png' (image/png) to .\file0.png
> Extracting 'file1.emf' (application/x-emf) to .\file1.emf
> Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
> Extracting 'file3.emf' (application/x-emf) to .\file3.emf
> Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
> Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
> .\MBD0016BDE4\?£☺.bin
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@75f875f8
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
> filename, directory name, or volume label syntax is incorrect.)
> at java.io.FileOutputStream.(FileOutputStream.java:205)
> at java.io.FileOutputStream.(FileOutputStream.java:156)
> at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
> at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
> at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
> at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat}
> TikaCLI manages to create the sub-directory, but because the embedded 
> fileName has invalid (for Windows) characters, it fails.
> On Linux it runs fine.
> I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Help on 1.4/1.5

2013-12-29 Thread Stefano Fornari
Hi Dave,
thanks for your replay, please see my comments inline.

On Sun, Dec 29, 2013 at 10:48 AM, David Meikle  wrote:

>
> In terms of the 1.5 release, it is down to the community in that we need
> to take a wee vote on if we are ready for one and agree if there is
> anything else that needs fixed or included in it.  There is a lot of issues
> marked as resolved but also 22 open[3], so there may be something you think
> you can contribute to in that list by means of a patch.
>

Chris was talking about spinning one up once he had a few free cycles but
> to kick the ball rolling I will start by putting out an email on what to
> include.
>
> Ok, sounds good. I may take 
> TIKA-1078;
maybe it is not the most interesting one, but since I am not familiar with
tika hacking, it could be a good starting point.

> Alternatively, I would backport the fix to 1.4 so that we could release a
> > 1.4.1 quickly. What do you think?
>
> With a release for 1.5 potentially just around the corner, my opinion
> would be that I think it would be better to focus on addressing anything
> that blocks releasing that instead of creating a back-port and then going
> through the release process for 1.4.1.
>
I tend to agree, but IMHO this really depends on when 1.5 is foreseeable.
If it takes still some while, or it is still undefined, it may make sense
release an update to 1.4. At the end it is out with a quite remarkable bug
and the only fix at the moment is to build a new 1.5 SNAPSHOT. What do the
others think?

Thanks again,
-- 
Ste


[jira] [Comment Edited] (TIKA-1214) Infinity Loop in Mpeg Stream

2013-12-29 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858299#comment-13858299
 ] 

Stefano Fornari edited comment on TIKA-1214 at 12/29/13 10:18 AM:
--

This looks like issue TIKA-1179 doesn't it? If so, it is fixed in 1.5-SNAPSHOT


was (Author: stefanofornari):
This looks like issue #1179 doesn't it? If so, it is fixed in 1.5-SNAPSHOT

> Infinity Loop in Mpeg Stream
> 
>
> Key: TIKA-1214
> URL: https://issues.apache.org/jira/browse/TIKA-1214
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: local system
>Reporter: Georg Hartmann
> Fix For: 1.5
>
>
> Scanning MP3 Files accounter a infiniy loop in the MpegStream Method 
> skipStream
> The Call of in.skip returnes zero so the loop never ends.
> Simple fix with zero count below
> private static void skipStream(InputStream in, long count) throws 
> IOException {
> long size = count;
> long skipped = 0;
> // 5 Times zero equals Error break the loop
> int zeroCount = 5;
> while (size > 0 && skipped >= 0) {
> skipped = in.skip(size);
> if (skipped != -1) {
> size -= skipped;
> }
> 
> // Checking for zero to break the infinity loop
> if (skipped == 0) {
> zeroCount--;
> }
> if (zeroCount < 0) {
> break;
> }
> }
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1214) Infinity Loop in Mpeg Stream

2013-12-29 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858299#comment-13858299
 ] 

Stefano Fornari commented on TIKA-1214:
---

This looks like issue #1179 doesn't it? If so, it is fixed in 1.5-SNAPSHOT

> Infinity Loop in Mpeg Stream
> 
>
> Key: TIKA-1214
> URL: https://issues.apache.org/jira/browse/TIKA-1214
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: local system
>Reporter: Georg Hartmann
> Fix For: 1.5
>
>
> Scanning MP3 Files accounter a infiniy loop in the MpegStream Method 
> skipStream
> The Call of in.skip returnes zero so the loop never ends.
> Simple fix with zero count below
> private static void skipStream(InputStream in, long count) throws 
> IOException {
> long size = count;
> long skipped = 0;
> // 5 Times zero equals Error break the loop
> int zeroCount = 5;
> while (size > 0 && skipped >= 0) {
> skipped = in.skip(size);
> if (skipped != -1) {
> size -= skipped;
> }
> 
> // Checking for zero to break the infinity loop
> if (skipped == 0) {
> zeroCount--;
> }
> if (zeroCount < 0) {
> break;
> }
> }
> }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Help on 1.4/1.5

2013-12-28 Thread Stefano Fornari
Dear dev,
in issue TIKA-1179 I was suggested to contribute to tika 1.5 to speed up
the release of the fix. I plan to use tika and I'll be happy to contribute
something back. Is there anything simple I can start with? How the
contribution process looks like?

Alternatively, I would backport the fix to 1.4 so that we could release a
1.4.1 quickly. What do you think?

Thanks in advance,

-- 
Ste


[jira] [Commented] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-11-24 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831063#comment-13831063
 ] 

Stefano Fornari commented on TIKA-1179:
---

I have ran into this issue too with valid music MP3. I join Marius question on 
availability of the 1.5 or 1 quick fix for 1.4

> A corrupt mp3 file can cause an infinite loop in Mp3Parser
> --
>
> Key: TIKA-1179
> URL: https://issues.apache.org/jira/browse/TIKA-1179
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Marius Dumitru Florea
>Assignee: Ray Gauss II
> Fix For: 1.5
>
> Attachments: corrupt.mp3
>
>
> I have a thread that indexes (among other things) files using Apache Sorl. 
> This thread hangs (still running but with no progress) when trying to extract 
> meta data from the mp3 file attached to this issue. Here are a couple of 
> thread dumps taken at various moments:
> {noformat}
> "XWiki Solr index thread" daemon prio=10 tid=0x03b72800 nid=0x64b5 
> runnable [0x7f46f4617000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
>   at 
> org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
>   at 
> org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(Unknown Source)
>   at java.io.BufferedInputStream.read1(Unknown Source)
>   at java.io.BufferedInputStream.read(Unknown Source)
>   - locked <0xcb7094e8> (a java.io.BufferedInputStream)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.FilterInputStream.read(Unknown Source)
>   at org.apache.tika.io.TailStream.read(TailStream.java:117)
>   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
>   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
>   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:380)
>   ...
> {noformat}
> {noformat}
> "XWiki Solr index thread" daemon prio=10 tid=0x03b72800 nid=0x64b5 
> runnable [0x7f46f4618000]
>java.lang.Thread.State: RUNNABLE
>   at org.apache.tika.io.TailStream.skip(TailStream.java:133)
>   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
>   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:380)
>   ...
> {noformat}
> {noformat}
> "XWiki Solr index thread" daemon prio=10 tid=0x03b72800 nid=0x64b5 
> runnable [0x7f46f4617000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.BufferedInputStream.read1(Unknown Source)
>   at java.io.BufferedInputStream.read(Unknown Source)
>   - locked <0xcb1be170> (a java.io.BufferedInputStream)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.FilterInputStream.read(Unknown Source)
>   at org.apache.tika.io.TailStream.read(TailStream.java:117)
>   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
>   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
>   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>   at 
> org.apache.tika.parser.CompositeParse