[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: tika-1215-without-wildcard.patch

[~gagravarr], my code style is different the one of Apache convention. 
Apologize for that.
I attached new patch file containing changes only.

Thanks


 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869528#comment-13869528
 ] 

Tim Allison commented on TIKA-1215:
---

[~thaichat04] thank you for sending a clean patch. This area of the code base 
is not exceedingly familiar to me, but if I understand Tika's history and your 
code correctly, your if statement wasn't necessary in 1.4, and (based on a very 
quick look) it looks like nothing else in the relevant lines of the MP3 parser 
changed between 1.4 and trunk.  Are you able to determine what changed btwn 1.4 
and trunk that led to this regression?  Thank you!

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Closed] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-13 Thread Sumeet Gorab (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet Gorab closed TIKA-1216.
--

Resolution: Fixed

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Fix For: 1.5

 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869590#comment-13869590
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


[~talli...@apache.org], here's XML of input to parse:
{noformat}
h1 xmlns=http://www.w3.org/1999/xhtml;Matin Première - Tour des régions 
080806/h1
pRTBF - La Première/p
pSpeech/p
p101698.914/p
pXXX - 
A propos du contrat de quartier rues Dublin/Dubreucq/p
{noformat}

I think this regression came from TIKA-1070
{code}
currentElement = currentElement.parent;
{code}

The parentElement of p is null, then getPrefix() raised exception, that's 
different from 1.4

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception

2014-01-13 Thread Sumeet Gorab (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet Gorab updated TIKA-1218:
---

Attachment: Save-the-World-Knife-Party-Remix.mp3

Getting exception using attached file.

 Unable to parse a mp3 file on 1.5 getting a exception
 -

 Key: TIKA-1218
 URL: https://issues.apache.org/jira/browse/TIKA-1218
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
 Environment: Win 7, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
 Attachments: Save-the-World-Knife-Party-Remix.mp3


 Unable to parse a mp3 file on 1.5 getting following exception:
 Exception in thread main java.lang.NegativeArraySizeException
   at 
 org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:417)
   at 
 org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:382)
   at 
 org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371)
   at 
 org.apache.tika.parser.mp3.ID3v24Handler.init(ID3v24Handler.java:49)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception

2014-01-13 Thread Sumeet Gorab (JIRA)
Sumeet Gorab created TIKA-1218:
--

 Summary: Unable to parse a mp3 file on 1.5 getting a exception
 Key: TIKA-1218
 URL: https://issues.apache.org/jira/browse/TIKA-1218
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
 Environment: Win 7, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker


Unable to parse a mp3 file on 1.5 getting following exception:

Exception in thread main java.lang.NegativeArraySizeException
at 
org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:417)
at 
org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:382)
at 
org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371)
at 
org.apache.tika.parser.mp3.ID3v24Handler.init(ID3v24Handler.java:49)
at 
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception

2014-01-13 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869662#comment-13869662
 ] 

Jukka Zitting commented on TIKA-1218:
-

Reproduced. It looks like the last frame that Tika can interpret is a TRAKTOR4 
PRIV frame at offset 114 with size of 335387 bytes. It could be that this frame 
is malformed (wrong size, etc.), or there might be a bug in the way Tika 
handles the frame. The tooling at http://dope.cz/code/ might be helpful in 
debugging this case.

 Unable to parse a mp3 file on 1.5 getting a exception
 -

 Key: TIKA-1218
 URL: https://issues.apache.org/jira/browse/TIKA-1218
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
 Environment: Win 7, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
 Attachments: Save-the-World-Knife-Party-Remix.mp3


 Unable to parse a mp3 file on 1.5 getting following exception:
 Exception in thread main java.lang.NegativeArraySizeException
   at 
 org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:417)
   at 
 org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:382)
   at 
 org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371)
   at 
 org.apache.tika.parser.mp3.ID3v24Handler.init(ID3v24Handler.java:49)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1215.
-

Resolution: Not A Problem

You're misusing the {{ToHTMLContentHandler}} class:

{code}
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, UTF-8);
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);
{code}

The {{ToHTMLContentHandler}} javadoc says:

bq. The incoming SAX events are expected to be well-formed (properly nested, 
etc.) and valid HTML.

This is not true since you're using the {{BodyContentHandler}} to strip out 
anything outside the {{body}} element.

Thus resolving as Not A Problem. If you want to format the parse output as 
HTML, you should pass the {{ToHTMLContentHandler}} directly to the parser, 
without the {{BodyContentHandler}} wrapper.

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1214) Infinity Loop in Mpeg Stream

2014-01-13 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-1214.
-

   Resolution: Duplicate
Fix Version/s: (was: 1.5)

Resolving as duplicate of TIKA-1179. Please reopen, preferably with a test 
case/document, if this problem still occurs.

 Infinity Loop in Mpeg Stream
 

 Key: TIKA-1214
 URL: https://issues.apache.org/jira/browse/TIKA-1214
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: local system
Reporter: Georg Hartmann

 Scanning MP3 Files accounter a infiniy loop in the MpegStream Method 
 skipStream
 The Call of in.skip returnes zero so the loop never ends.
 Simple fix with zero count below
 private static void skipStream(InputStream in, long count) throws 
 IOException {
 long size = count;
 long skipped = 0;
 // 5 Times zero equals Error break the loop
 int zeroCount = 5;
 while (size  0  skipped = 0) {
 skipped = in.skip(size);
 if (skipped != -1) {
 size -= skipped;
 }
 
 // Checking for zero to break the infinity loop
 if (skipped == 0) {
 zeroCount--;
 }
 if (zeroCount  0) {
 break;
 }
 }
 }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-13 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869780#comment-13869780
 ] 

Jukka Zitting commented on TIKA-1217:
-

Thanks! I committed the patch in revision 1557795.

Before we resolve this as fixed, I think it would be better to use 
{{detect(File)}} directly, without first trying type detection based on just 
the file name. Name-based type lookup is less accurate than file-based, and 
AFAICT there are few applications where  file type lookup is 
performance-critical (and if it is, like in a directory browser, the results 
are often cached). And assuming the file-based lookup was conditional on the 
existence of the file, a client could still do name-only lookups by prepending 
a fictional directory name to the path passed to the {{probeContentType()}} 
call. WDYT?

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217-v2.patch, TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-13 Thread Peter Ansell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870087#comment-13870087
 ] 

Peter Ansell commented on TIKA-1217:


[~jukkaz] The rationale for checking first on filename, in a Java-7 context, 
was that Path objects do not hold File Descriptors. Hence, a content type 
detection method taking a Path object may also be able to avoid getting a File 
Descriptor.

However, if there is an unacceptable loss in fidelity by checking first on the 
filename then feel free to remove that clause, as it isn't critical to the 
functionality for me.

There cannot, however, easily be two different implementations in the same 
module, as java.util.ServiceLoader isn't ordered so it cannot preference one 
over the other. In addition, there are no OpenOptions or LinkOptions attached 
to Files.probeContentType as there are with other methods such as 
Files.isRegularFile. That makes it difficult for users to pass in their 
preferences about how Files.probeContentType should operate (ie, whether it 
should try to avoid getting a file descriptor if possible, or not to follow 
symbolic links).

If we wanted to do a second implementation that always used File it would be 
perfectly possible, but it would need to go in a separate module to distinguish 
between the META-INF/services files based on which module is loaded. We would 
also have to rename the current module from tika-java7 to something more 
specific.

As you say, in a performance critical application, the results will be cached 
to avoid duplication, so it isn't a big deal in the greater scheme of things.

[~lewismc] You can find the patch that Jukka committed in the Tika trunk if you 
want to test it, but it isn't necessary to do it now if you have other things 
to do. 
https://github.com/apache/tika/commit/39370848b8bd9214dc4b7720539edc0eb595300c

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217-v2.patch, TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-13 Thread Stefano Fornari (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Fornari updated TIKA-1078:
--

Attachment: tika-1078-2.patch

 TikaCLI: invalid characters in embedded document name causes FNFE when trying 
 to save
 -

 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, 
 tika-1078.patch


 Attached document hits this on Windows:
 {noformat}
 C:\java.exe -jar tika-app-1.3.jar -z -x 
 c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
 Extracting 'file0.png' (image/png) to .\file0.png
 Extracting 'file1.emf' (application/x-emf) to .\file1.emf
 Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
 Extracting 'file3.emf' (application/x-emf) to .\file3.emf
 Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
 Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
 .\MBD0016BDE4\?£☺.bin
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@75f875f8
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
 filename, directory name, or volume label syntax is incorrect.)
 at java.io.FileOutputStream.init(FileOutputStream.java:205)
 at java.io.FileOutputStream.init(FileOutputStream.java:156)
 at 
 org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
 at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
 at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat}
 TikaCLI manages to create the sub-directory, but because the embedded 
 fileName has invalid (for Windows) characters, it fails.
 On Linux it runs fine.
 I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save

2014-01-13 Thread Stefano Fornari (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870125#comment-13870125
 ] 

Stefano Fornari commented on TIKA-1078:
---

Hi Michael,
thanks for the review. I took into account all your comments. About the 
directory structure, I reverted my change now that I understood better the 
original behaviour. I think the original behaviour is cleaner and nicer.

attaching the new patch.


 TikaCLI: invalid characters in embedded document name causes FNFE when trying 
 to save
 -

 Key: TIKA-1078
 URL: https://issues.apache.org/jira/browse/TIKA-1078
 Project: Tika
  Issue Type: Bug
  Components: cli, parser
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, 
 tika-1078.patch


 Attached document hits this on Windows:
 {noformat}
 C:\java.exe -jar tika-app-1.3.jar -z -x 
 c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
 Extracting 'file0.png' (image/png) to .\file0.png
 Extracting 'file1.emf' (application/x-emf) to .\file1.emf
 Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
 Extracting 'file3.emf' (application/x-emf) to .\file3.emf
 Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
 Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
 .\MBD0016BDE4\?£☺.bin
 Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: 
 Illegal IOException from 
 org.apache.tika.parser.microsoft.OfficeParser@75f875f8
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
 filename, directory name, or volume label syntax is incorrect.)
 at java.io.FileOutputStream.init(FileOutputStream.java:205)
 at java.io.FileOutputStream.init(FileOutputStream.java:156)
 at 
 org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
 at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
 at 
 org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat}
 TikaCLI manages to create the sub-directory, but because the embedded 
 fileName has invalid (for Windows) characters, it fails.
 On Linux it runs fine.
 I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)