from:"Hong\-Thai Nguyen \(JIRA\)"

[jira] [Commented] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492084#comment-14492084
 ] 

Hong-Thai Nguyen commented on TIKA-1600:


The root exception is an NPE when parsing ODT files with elements in footnote:
{code}
java.lang.NullPointerException
at 
org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startSpan(OpenDocumentContentParser.java:174)
at 
org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startElement(OpenDocumentContentParser.java:287)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:69)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2756)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:647)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at 
org.apache.tika.parser.odf.OpenDocumentContentParser.parseInternal(OpenDocumentContentParser.java:503)
at 
org.apache.tika.parser.odf.OpenDocumentParser.handleZipEntry(OpenDocumentParser.java:187)
at 
org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:164)
at 
org.apache.tika.parser.odf.OpenDocumentParserTest.can_parse_odt_file(OpenDocumentParserTest.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
{code}

Seems that supporting style for ODF is recently added in 1.8:
{noformat}
Revision: 107
Author: tpalsulich
Date: samedi 14 mars 2015 00:25:53

[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-30 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386900#comment-14386900
]

Hong-Thai Nguyen commented on TIKA-1581:

And great thank to [~kkrugler] with many investigation and efforts to push
release of jhighlight 1.0.2

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
Fix For: 1.8

jhighlight jar is a Tika dependency. The Lucene team discovered that, while
it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL
only:
{code}
Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself
as dual CDDL or LGPL license. However, some of its classes are distributed
only under LGPL, e.g.
com.uwyn.jhighlight.highlighter.
CppHighlighter.java
GroovyHighlighter.java
JavaHighlighter.java
XmlHighlighter.java
I downloaded the sources from Maven
(http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
to confirm that, and also found this SVN repo:
http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's
website seems to not exist anymore (https://jhighlight.dev.java.net/).
I didn't find any direct usage of it in our code, so I guess it's probably
needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it,
things will compile, but may fail at runtime.
{code}
Is it possible to remove this dependency for future releases, or allow only
optional inclusion of this package? It is of concern to the ManifoldCF
project because we distribute a binary package that includes Tika and its
required dependencies, which currently includes jHighlight.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hong-Thai Nguyen updated TIKA-1581:
---
Fix Version/s: 1.8

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
Fix For: 1.8

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hong-Thai Nguyen resolved TIKA-1581.

Resolution: Fixed

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
Fix For: 1.8

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371432#comment-14371432
]

Hong-Thai Nguyen commented on TIKA-1581:

I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their
dependencies without Apache license

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371432#comment-14371432
]

Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:10 PM:
-

[~steve_rowe], folked vesion you mentioned don't change anything about original
license terms of JHighlight.

was (Author: thaichat04):
I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their
dependencies without Apache license

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371432#comment-14371432
]

Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:36 PM:
-

[~steve_rowe], folked vesion you mentioned don't change anything about original
license terms of JHighlight.

jhighlight license concerns
---

Key: TIKA-1581
URL: https://issues.apache.org/jira/browse/TIKA-1581
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2015-01-05 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264786#comment-14264786
 ] 

Hong-Thai Nguyen commented on TIKA-1505:


Can you provide also problem files and tests ?
And, 1.7 in releasing out, this issue is not really blocking and we can 
postpone to next 1.8

 chmparser breaks down when extracting from file of CHM format v3
 

 Key: TIKA-1505
 URL: https://issues.apache.org/jira/browse/TIKA-1505
 Project: Tika
  Issue Type: Bug
Reporter: Bin Hawking
 Fix For: 1.7


 chmparser throws exception or returns faulty text when:
 1. extracting from file of CHM format version 3
 2. chm file with lzx reset interval  2
 3. chm file with 5000 objects
 I am making the fix now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1447.

Resolution: Fixed

 CHM parser: wrong directory list
 

 Key: TIKA-1447
 URL: https://issues.apache.org/jira/browse/TIKA-1447
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical

 CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
 tika-parser's test-resources):
 1. Duplicate entries (mostly from PMGI chunks, which should have been 
 ignored.)
 2. Invalid entry (usually with unreadable entry name).
 3. Missed entries (some times it is like TIKA-1176)
 I have fixed it (to some degree), by using the PMGL header to find dir chunks 
 and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1446.

Resolution: Fixed

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1447:
---
Fix Version/s: 1.7

 CHM parser: wrong directory list
 

 Key: TIKA-1447
 URL: https://issues.apache.org/jira/browse/TIKA-1447
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7


 CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
 tika-parser's test-resources):
 1. Duplicate entries (mostly from PMGI chunks, which should have been 
 ignored.)
 2. Invalid entry (usually with unreadable entry name).
 3. Missed entries (some times it is like TIKA-1176)
 I have fixed it (to some degree), by using the PMGL header to find dir chunks 
 and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1448.

Resolution: Fixed

 CHM parser : defect in file extraction
 --

 Key: TIKA-1448
 URL: https://issues.apache.org/jira/browse/TIKA-1448
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Bin Hawking
 Fix For: 1.7


 in ChmBlockInfo class:
 chmBlockInfo
 .setIniBlock((chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock)
 % (int) clcd.getResetInterval());
 always sets 0
 according to the lzx algorithm, should be
 chmBlockInfo
 .setIniBlock( chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock
 % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1430.

Resolution: Fixed

 CHM parser gets faulty text (fix found)
 ---

 Key: TIKA-1430
 URL: https://issues.apache.org/jira/browse/TIKA-1430
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
 Environment: Windows 7; JDK 7 or 8
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7


 Get partially wrong text out of a CHM file, including the chm files in 
 tika-parsers/src/test/resources/test-documents/testChm*.chm
 I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
 I checked the source code. The cause is obvious:
 When tika decompresses the LZX, the first block is done well, but as to the 
 2nd block and later on, Tika uses previous content as the compressed data. 
 see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
 
 if (prevBlock != null
  prevBlock.getState().getBlockLength()  prevBlock
 .getState().getBlockRemaining())
 setChmSection(new ChmSection(prevBlock.getContent()));
 //   NOTE: the dataSegment to be decompressed is not kept
 else
 setChmSection(new ChmSection(dataSegment));
 
 My fix:
 1.Add a prevcontent member variable in ChmSection class, so that 
 dataSegment and prevBlock.getContent() are both kept in it.
 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
 pass ChmSection.prevcontent if exists, instead of ChmSection.data.
 Now, I tried some chm files, and got the correct looking texts. 
 BTW. The unit test should be tougher, as in this case some small text (the 
 first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1430:
---
Fix Version/s: 1.7

 CHM parser gets faulty text (fix found)
 ---

 Key: TIKA-1430
 URL: https://issues.apache.org/jira/browse/TIKA-1430
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5, 1.6
 Environment: Windows 7; JDK 7 or 8
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7


 Get partially wrong text out of a CHM file, including the chm files in 
 tika-parsers/src/test/resources/test-documents/testChm*.chm
 I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
 I checked the source code. The cause is obvious:
 When tika decompresses the LZX, the first block is done well, but as to the 
 2nd block and later on, Tika uses previous content as the compressed data. 
 see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
 
 if (prevBlock != null
  prevBlock.getState().getBlockLength()  prevBlock
 .getState().getBlockRemaining())
 setChmSection(new ChmSection(prevBlock.getContent()));
 //   NOTE: the dataSegment to be decompressed is not kept
 else
 setChmSection(new ChmSection(dataSegment));
 
 My fix:
 1.Add a prevcontent member variable in ChmSection class, so that 
 dataSegment and prevBlock.getContent() are both kept in it.
 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
 pass ChmSection.prevcontent if exists, instead of ChmSection.data.
 Now, I tried some chm files, and got the correct looking texts. 
 BTW. The unit test should be tougher, as in this case some small text (the 
 first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1446:
---
Fix Version/s: 1.7

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Fix For: 1.7

 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1448:
---
Fix Version/s: 1.7

 CHM parser : defect in file extraction
 --

 Key: TIKA-1448
 URL: https://issues.apache.org/jira/browse/TIKA-1448
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Bin Hawking
 Fix For: 1.7


 in ChmBlockInfo class:
 chmBlockInfo
 .setIniBlock((chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock)
 % (int) clcd.getResetInterval());
 always sets 0
 according to the lzx algorithm, should be
 chmBlockInfo
 .setIniBlock( chmBlockInfo.startBlock - 
 chmBlockInfo.startBlock
 % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-672:
--
Fix Version/s: 1.7

 Proper error handling in the CHM parser
 ---

 Key: TIKA-672
 URL: https://issues.apache.org/jira/browse/TIKA-672
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
 Fix For: 1.7


 The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
 System.out prints to report problems in many places. We should change that to 
 properly throw exceptions as follows:
 - IOExceptions when the document stream can not be read
 - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-672.
---
Resolution: Fixed

Check no more System.err/System.out inside CHM parser

 Proper error handling in the CHM parser
 ---

 Key: TIKA-672
 URL: https://issues.apache.org/jira/browse/TIKA-672
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Jukka Zitting
Priority: Minor
 Fix For: 1.7


 The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
 System.out prints to report problems in many places. We should change that to 
 properly throw exceptions as follows:
 - IOExceptions when the document stream can not be read
 - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1447) CHM parser: wrong directory list

2014-11-17 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214535#comment-14214535
 ] 

Hong-Thai Nguyen commented on TIKA-1447:


[~binhawking], The work on TIKA-1446 fixed this issue ? Any change to double 
check again ?

Thanks,

 CHM parser: wrong directory list
 

 Key: TIKA-1447
 URL: https://issues.apache.org/jira/browse/TIKA-1447
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical

 CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
 tika-parser's test-resources):
 1. Duplicate entries (mostly from PMGI chunks, which should have been 
 ignored.)
 2. Invalid entry (usually with unreadable entry name).
 3. Missed entries (some times it is like TIKA-1176)
 I have fixed it (to some degree), by using the PMGL header to find dir chunks 
 and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Hi [~binhawking], I've merge your pull request and make title comparison 
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed 
files. I've pushed failed CHM files under _test-documents/chm_  a checking 
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-12 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079
]

Hong-Thai Nguyen edited comment on TIKA-1446 at 11/12/14 2:38 PM:
--

Hi [~binhawking], I've merged your contribution and make title comparison
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed
files. I've pushed failed CHM files under _test-documents/chm_ a checking
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

was (Author: thaichat04):
Hi [~binhawking], I've merge your pull request and make title comparison
before/after on a local corpus of CHM files.
Before merge, we have only one failed file, after merge we have 10 failed
files. I've pushed failed CHM files under _test-documents/chm_ a checking
test case into: https://github.com/thaichat04/tika
I made also some clean-up.

Any chance you have a look again ?

CHM parser : wrong decompression of aligned blocks
--

Key: TIKA-1446
URL: https://issues.apache.org/jira/browse/TIKA-1446
Project: Tika
Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
Attachments: chm.zip

If an embedded file contains aligned blocks, the parser outputs chaotic text
or empty text as to this file.
I have fixed it myself, corrected decompressAlignedBlock() and its
preparation methods. Mostly this bug is due to misusing main tree/align
tree/length tree. And some tree is built wrong.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-04 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196343#comment-14196343
 ] 

Hong-Thai Nguyen commented on TIKA-1463:


Thank [~lfcnassif], without .exe effectively works also. BTW, path with space 
is buggy.
I leave this fix because adding .exe  only in Windows don't hurt anything.

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

Hong-Thai Nguyen created TIKA-1463:
--

 Summary: TesseractOCRParser does work in Windows
 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen


STR:
* Case 1:
** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
** the checking  running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194694#comment-14194694
 ] 

Hong-Thai Nguyen commented on TIKA-1463:


Fixed in r1636382

 TesseractOCRParser does work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1463:
---
Summary: TesseractOCRParser does not work in Windows  (was: 
TesseractOCRParser does work in Windows)

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1463:
---
Description: 
STR:
* Case 1:
** Setting tesseractPath to a common installation path of Tesseract:  
C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
** the checking  running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe

  was:
STR:
* Case 1:
** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
** the checking  running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe


 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen closed TIKA-1463.
--
Resolution: Fixed

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181530#comment-14181530
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Thank alot [~binhawking], I've quick look on your fix. Effectually, there's 
quite a lot of changes. After cleanup  fix some minor, I broke CHM tests.

We appreciate really your contribution and we should continue  finalize. I've 
created new pull request basing on a branch with your fix + my cleanup:
https://github.com/apache/tika/pull/21
https://github.com/thaichat04/tika.git, branch TIKA-1446

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


Applied latest fix on r1633325 with some formatting. Thank

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at

[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186
 ] 

Hong-Thai Nguyen edited comment on TIKA-1422 at 10/21/14 9:48 AM:
--

Applied latest fix on r1633325  r161 with some formatting. Thank


was (Author: thaichat04):
Applied latest fix on r1633325 with some formatting. Thank

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-16 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173537#comment-14173537
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


I'm not using Tesseract

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, 
 TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169090#comment-14169090
]

Hong-Thai Nguyen commented on TIKA-1445:

Interesting question !
For me, parser's selection and parsers priority decision should be done on
runtime by configuration, not inside a parser.
Image's parser is an interesting case of concurrent parsers (Tesseract vs
classical Image Parsers). We have double problem here:
1. When many parsers can work with same mime type, which one is selected ?
2. When we have many parsers, can we apply many parsers and merge results
(metadata handler) .

* For case 1, if we use a override config of parsers on runtime, we can declare
many parsers with matching mimetype and the later one in list will be selected.
We may extend CLI/WebService to inject this kind of configuration.
* For case 2, we don't have a solution for now. We may extend CompositeParser
to accept a mode 'many' parsers and call matching parsers in chain. The merging
result is an other problem.we can accept a same metadata name is override by an
other parser. The perfect solution is (again) using nested structure on our
metadata which enable store each parser's result.

Figure out how to add Image metadata extraction to Tesseract parser
---

Key: TIKA-1445
URL: https://issues.apache.org/jira/browse/TIKA-1445
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Fix For: 1.7

Attachments: TIKA-1445.Mattmann.101214.patch.txt

Now that Tesseract is the default image parser in Tika for many image types,
consider how to add back in the metadata extraction capabilities by the other
Image parsers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1176) ChmDirectoryListingSet does not correctly enumerate directory entries

2014-10-13 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169146#comment-14169146
 ] 

Hong-Thai Nguyen commented on TIKA-1176:


Hi [~mdgeek], thank for your offering code  testing file. Unfortunately, this 
check raised other exception on this file:
{code}
The full exception stack trace is included below:

org.apache.tika.exception.TikaException
at 
org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:355)
at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:70)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:326)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:285)
at 
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
at 
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
at javax.swing.TransferHandler.importData(TransferHandler.java:755)
at 
javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1478)
at java.awt.dnd.DropTarget.drop(DropTarget.java:434)
at 
javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1203)
at 
sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:519)
at 
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:832)
at 
sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:756)
at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:30)
at java.awt.Component.dispatchEventImpl(Component.java:4517)
at java.awt.Container.dispatchEventImpl(Container.java:2097)
at java.awt.Component.dispatchEvent(Component.java:4488)
at 
java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4575)
at 
java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4310)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4161)
at java.awt.Container.dispatchEventImpl(Container.java:2083)
at java.awt.Window.dispatchEventImpl(Window.java:2489)
at java.awt.Component.dispatchEvent(Component.java:4488)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:674)
at java.awt.EventQueue.access$400(EventQueue.java:81)
at java.awt.EventQueue$2.run(EventQueue.java:633)
at java.awt.EventQueue$2.run(EventQueue.java:631)
at java.security.AccessController.doPrivileged(Native Method)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
at java.awt.EventQueue$3.run(EventQueue.java:647)
at java.awt.EventQueue$3.run(EventQueue.java:645)
at java.security.AccessController.doPrivileged(Native Method)
at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:644)
at 
java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
at 
java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
at 
java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:342)
at 
org.apache.tika.parser.chm.core.ChmCommons.getChmBlockSegment(ChmCommons.java:108)
at 
org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:337)
... 43 more
{code} 

It's quite complex our CHM Parser, can you apply a full fix and a test with 
expected content in output on your file ?

Thanks,

 ChmDirectoryListingSet does not correctly enumerate directory entries
 -

 Key: TIKA-1176
 URL: https://issues.apache.org/jira/browse/TIKA-1176
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Doug Martin
 Attachments: HelpStudioSample.chm

[jira] [Commented] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character

2014-09-25 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147880#comment-14147880
 ] 

Hong-Thai Nguyen commented on TIKA-1428:


Thanks [~theoettheo], any chance to have a patch with a test case for this 
problem ?

 Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement 
 Character
 -

 Key: TIKA-1428
 URL: https://issues.apache.org/jira/browse/TIKA-1428
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.4, 1.6
Reporter: Theodor Sjöstedt
Priority: Minor
 Attachments: TIKA-doc-footnotes-issue.png


 Footnotes from {{.doc}} documents are extracted, but the references to the 
 footnotes are replaced by the Unicode Replacement Character (�).
 I have tried this in 1.4 and 1.6.
 In 1.4, both reference in text and reference at footnote have been replaced.
 In 1.6, reference in text has disappeared completely.
 See attached image for original document, 1.4 Formatted text, and 1.6 
 Formatted text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1421) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed

2014-09-22 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143041#comment-14143041
 ] 

Hong-Thai Nguyen commented on TIKA-1421:


Not only CentOS, this test failed also on my Windows without Tesseract 
installed.

 Tika-Parsers tests fail on CentOS6 if tesseract isn't installed
 ---

 Key: TIKA-1421
 URL: https://issues.apache.org/jira/browse/TIKA-1421
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: CentOS6 AWS VM for DARPA Memex
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7


 While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 
 1.7-trunk fresh install of tika in tika-parsers:
 {noformat}
 Running org.apache.tika.parser.chm.TestChmLzxcControlData
 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
 Running org.apache.tika.parser.chm.TestChmBlockInfo
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running org.apache.tika.parser.chm.TestChmItsfHeader
 Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
 Running org.apache.tika.parser.txt.TXTParserTest
 Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
 Running org.apache.tika.parser.txt.CharsetDetectorTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
 Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
 Running org.apache.tika.parser.image.PSDParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
 Running org.apache.tika.parser.image.ImageParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
 Running org.apache.tika.parser.image.ImageMetadataExtractorTest
 Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
 Running org.apache.tika.parser.image.MetadataFieldsTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
 Running org.apache.tika.parser.image.TiffParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
 Running org.apache.tika.parser.font.FontParsersTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
 Running org.apache.tika.parser.mp4.MP4ParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
 Running org.apache.tika.parser.mp3.Mp3ParserTest
 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
 Running org.apache.tika.parser.mp3.MpegStreamTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running org.apache.tika.parser.dwg.DWGParserTest
 Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
 Running org.apache.tika.parser.pkg.GzipParserTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
 Running org.apache.tika.parser.pkg.Seven7ParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
 Running org.apache.tika.parser.pkg.TarParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
 Running org.apache.tika.parser.pkg.Bzip2ParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
 Running org.apache.tika.parser.pkg.ArParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
 Running org.apache.tika.parser.pkg.ZipParserTest
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
 Running org.apache.tika.parser.video.FLVParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
 Running org.apache.tika.parser.solidworks.SolidworksParserTest
 Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
 Running org.apache.tika.parser.ibooks.iBooksParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
 Running org.apache.tika.parser.ParsingReaderTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
 Running org.apache.tika.parser.mail.RFC822ParserTest
 Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec  
 FAILURE!
 Running org.apache.tika.parser.mbox.MboxParserTest
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
 Running org.apache.tika.parser.mbox.OutlookPSTParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
 Running org.apache.tika.parser.jpeg.JpegParserTest
 Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
 Running org.apache.tika.parser.executable.ExecutableParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running

[jira] [Updated] (TIKA-1421) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed

2014-09-22 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1421:
---
Priority: Blocker  (was: Major)

 Tika-Parsers tests fail on CentOS6 if tesseract isn't installed
 ---

 Key: TIKA-1421
 URL: https://issues.apache.org/jira/browse/TIKA-1421
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: CentOS6 AWS VM for DARPA Memex
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7


 While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 
 1.7-trunk fresh install of tika in tika-parsers:
 {noformat}
 Running org.apache.tika.parser.chm.TestChmLzxcControlData
 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
 Running org.apache.tika.parser.chm.TestChmBlockInfo
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running org.apache.tika.parser.chm.TestChmItsfHeader
 Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
 Running org.apache.tika.parser.txt.TXTParserTest
 Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
 Running org.apache.tika.parser.txt.CharsetDetectorTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
 Running org.apache.tika.parser.image.xmp.JempboxExtractorTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
 Running org.apache.tika.parser.image.PSDParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
 Running org.apache.tika.parser.image.ImageParserTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
 Running org.apache.tika.parser.image.ImageMetadataExtractorTest
 Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec
 Running org.apache.tika.parser.image.MetadataFieldsTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
 Running org.apache.tika.parser.image.TiffParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
 Running org.apache.tika.parser.font.FontParsersTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec
 Running org.apache.tika.parser.mp4.MP4ParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec
 Running org.apache.tika.parser.mp3.Mp3ParserTest
 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec
 Running org.apache.tika.parser.mp3.MpegStreamTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running org.apache.tika.parser.dwg.DWGParserTest
 Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
 Running org.apache.tika.parser.pkg.GzipParserTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec
 Running org.apache.tika.parser.pkg.Seven7ParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec
 Running org.apache.tika.parser.pkg.TarParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec
 Running org.apache.tika.parser.pkg.Bzip2ParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec
 Running org.apache.tika.parser.pkg.ArParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
 Running org.apache.tika.parser.pkg.ZipParserTest
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec
 Running org.apache.tika.parser.video.FLVParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
 Running org.apache.tika.parser.solidworks.SolidworksParserTest
 Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
 Running org.apache.tika.parser.ibooks.iBooksParserTest
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
 Running org.apache.tika.parser.ParsingReaderTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
 Running org.apache.tika.parser.mail.RFC822ParserTest
 Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec  
 FAILURE!
 Running org.apache.tika.parser.mbox.MboxParserTest
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec
 Running org.apache.tika.parser.mbox.OutlookPSTParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec
 Running org.apache.tika.parser.jpeg.JpegParserTest
 Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec
 Running org.apache.tika.parser.executable.ExecutableParserTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
 Running org.apache.tika.parser.rtf.RTFParserTest
 Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time

[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser

2014-09-22 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143043#comment-14143043
 ] 

Hong-Thai Nguyen commented on TIKA-1412:


Add a test at r1626706

 NPE in OpenDocumentParser
 -

 Key: TIKA-1412
 URL: https://issues.apache.org/jira/browse/TIKA-1412
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 
 Fix For: 1.7

 Attachments: TIKA-1412.diff


 There's a missing else in OpenDocumentParser when it constructs a 
 ZipInputStream from the InputStream, which results in NPE when the 
 InputStream is an instance of TikaInputStream but has neither openContainer 
 nor file:
 {code}
 ...
 Caused by: java.lang.NullPointerException
 at 
 org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161)
  ~[tika-parsers-1.6.jar:1.6]
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) 
 ~[tika-core-1.6.jar:1.6]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1413.

Resolution: Fixed

 OOXML thumbnail name added to body
 --

 Key: TIKA-1413
 URL: https://issues.apache.org/jira/browse/TIKA-1413
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 

 AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
 EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
 other embedded parts in handleEmbeddedParts(...)).
 This results in adding the thumbnail name to the main body of the document 
 (as a package-entry), which in my opinion is wrong.
 Example:
 {code}
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=meta:slide-count content=1/
 meta name=cp:revision content=5/
 meta name=meta:last-author content=Nick Burch/
 meta name=Slide-Count content=1/
 meta name=Last-Author content=Nick Burch/
 meta name=meta:save-date content=2010-09-08T16:15:14Z/
 meta name=Content-Length content=202969/
 meta name=subject content=Gym class featuring a brown fox and lazy dog/
 meta name=Application-Name content=Microsoft Office PowerPoint/
 meta name=Author content=Nevin Nollop/
 meta name=dcterms:created content=1601-01-01T00:00:00Z/
 meta name=Application-Version content=12./
 meta name=date content=2010-09-08T16:15:14Z/
 meta name=Total-Time content=2/
 meta name=extended-properties:Template content=/
 meta name=publisher content=/
 meta name=creator content=Nevin Nollop/
 meta name=Word-Count content=9/
 meta name=meta:paragraph-count content=1/
 meta name=extended-properties:AppVersion content=12./
 meta name=Creation-Date content=1601-01-01T00:00:00Z/
 meta name=meta:author content=Nevin Nollop/
 meta name=cp:subject content=Gym class featuring a brown fox and lazy 
 dog/
 meta name=extended-properties:Application content=Microsoft Office 
 PowerPoint/
 meta name=resourceName content=testPPT_embeded.pptx/
 meta name=Paragraph-Count content=1/
 meta name=dc:title content=The quick brown fox jumps over the lazy dog/
 meta name=Last-Save-Date content=2010-09-08T16:15:14Z/
 meta name=custom:Version content=1/
 meta name=Revision-Number content=5/
 meta name=Last-Printed content=1601-01-01T00:00:00Z/
 meta name=meta:print-date content=1601-01-01T00:00:00Z/
 meta name=meta:creation-date content=1601-01-01T00:00:00Z/
 meta name=dcterms:modified content=2010-09-08T16:15:14Z/
 meta name=Template content=/
 meta name=dc:creator content=Nevin Nollop/
 meta name=meta:word-count content=9/
 meta name=extended-properties:Company content=/
 meta name=Last-Modified content=2010-09-08T16:15:14Z/
 meta name=extended-properties:PresentationFormat content=On-screen Show 
 (4:3)/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By 
 content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/
 meta name=modified content=2010-09-08T16:15:14Z/
 meta name=xmpTPg:NPages content=1/
 meta name=extended-properties:TotalTime content=2/
 meta name=dc:publisher content=/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=Presentation-Format content=On-screen Show (4:3)/
 titleThe quick brown fox jumps over the lazy dog/title
 /head
 bodypThe quick brown fox jumps over the lazy dog/p
 div class=embedded id=slide1_rId4/
 div class=embedded id=slide1_rId5/
 div class=embedded id=slide1_rId6/
 div class=embedded id=slide1_rId7/
 div class=embedded id=slide1_rId8/
 div class=embedded id=slide1_rId9/
 div class=embedded id=thumbnail_0.jpeg/div 
 class=package-entryh1thumbnail_0.jpeg/h1/div/body/html
 {code}
 The extracted plain text looks like this (using tika-app):
 {code}
 The quick brown fox jumps over the lazy dog
 thumbnail_0.jpeg
 {code}
 The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
 I think also that the id attribute should be set to the real thumbnail path 
 within the package (i.e. tPart.getPartName().getName()) instead of the 
 artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126949#comment-14126949
 ] 

Hong-Thai Nguyen commented on TIKA-1413:


I agree. Fixed in r1623819 and _id_ is now from partName().

 OOXML thumbnail name added to body
 --

 Key: TIKA-1413
 URL: https://issues.apache.org/jira/browse/TIKA-1413
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 

 AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
 EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
 other embedded parts in handleEmbeddedParts(...)).
 This results in adding the thumbnail name to the main body of the document 
 (as a package-entry), which in my opinion is wrong.
 Example:
 {code}
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=meta:slide-count content=1/
 meta name=cp:revision content=5/
 meta name=meta:last-author content=Nick Burch/
 meta name=Slide-Count content=1/
 meta name=Last-Author content=Nick Burch/
 meta name=meta:save-date content=2010-09-08T16:15:14Z/
 meta name=Content-Length content=202969/
 meta name=subject content=Gym class featuring a brown fox and lazy dog/
 meta name=Application-Name content=Microsoft Office PowerPoint/
 meta name=Author content=Nevin Nollop/
 meta name=dcterms:created content=1601-01-01T00:00:00Z/
 meta name=Application-Version content=12./
 meta name=date content=2010-09-08T16:15:14Z/
 meta name=Total-Time content=2/
 meta name=extended-properties:Template content=/
 meta name=publisher content=/
 meta name=creator content=Nevin Nollop/
 meta name=Word-Count content=9/
 meta name=meta:paragraph-count content=1/
 meta name=extended-properties:AppVersion content=12./
 meta name=Creation-Date content=1601-01-01T00:00:00Z/
 meta name=meta:author content=Nevin Nollop/
 meta name=cp:subject content=Gym class featuring a brown fox and lazy 
 dog/
 meta name=extended-properties:Application content=Microsoft Office 
 PowerPoint/
 meta name=resourceName content=testPPT_embeded.pptx/
 meta name=Paragraph-Count content=1/
 meta name=dc:title content=The quick brown fox jumps over the lazy dog/
 meta name=Last-Save-Date content=2010-09-08T16:15:14Z/
 meta name=custom:Version content=1/
 meta name=Revision-Number content=5/
 meta name=Last-Printed content=1601-01-01T00:00:00Z/
 meta name=meta:print-date content=1601-01-01T00:00:00Z/
 meta name=meta:creation-date content=1601-01-01T00:00:00Z/
 meta name=dcterms:modified content=2010-09-08T16:15:14Z/
 meta name=Template content=/
 meta name=dc:creator content=Nevin Nollop/
 meta name=meta:word-count content=9/
 meta name=extended-properties:Company content=/
 meta name=Last-Modified content=2010-09-08T16:15:14Z/
 meta name=extended-properties:PresentationFormat content=On-screen Show 
 (4:3)/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By 
 content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/
 meta name=modified content=2010-09-08T16:15:14Z/
 meta name=xmpTPg:NPages content=1/
 meta name=extended-properties:TotalTime content=2/
 meta name=dc:publisher content=/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=Presentation-Format content=On-screen Show (4:3)/
 titleThe quick brown fox jumps over the lazy dog/title
 /head
 bodypThe quick brown fox jumps over the lazy dog/p
 div class=embedded id=slide1_rId4/
 div class=embedded id=slide1_rId5/
 div class=embedded id=slide1_rId6/
 div class=embedded id=slide1_rId7/
 div class=embedded id=slide1_rId8/
 div class=embedded id=slide1_rId9/
 div class=embedded id=thumbnail_0.jpeg/div 
 class=package-entryh1thumbnail_0.jpeg/h1/div/body/html
 {code}
 The extracted plain text looks like this (using tika-app):
 {code}
 The quick brown fox jumps over the lazy dog
 thumbnail_0.jpeg
 {code}
 The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
 I think also that the id attribute should be set to the real thumbnail path 
 within the package (i.e. tPart.getPartName().getName()) instead of the 
 artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077885#comment-14077885
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Normally it's on next  official 1.6 release, but you can try with this 
candidate release: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073042#comment-14073042
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


HtmlParser skips tags generated by JHighlight. I found a solution by using 
directly TagSoup Parser. Commit in r1613051.
As I mentioned in TIKA-1224, this parser is quick  dirty approach to parser 
source code file. Again, the _right_ one parser is must have dedicate parser by 
language and parse deeply elements and build events on-the-fly.

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1373.


Resolution: Fixed

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071713#comment-14071713
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Yes, I saw the trouble when implementing this parser. How can we get that we 
are asking for text instead of HTML ? Can Handler is instanceOf 
BodyContentHandler is enough ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen edited comment on TIKA-1373 at 7/23/14 1:42 PM:
-

Can you format your description with {noformat}{code}{noformat} annotation and 
if I understand well the output of 1st section is empty ?


was (Author: thaichat04):
Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1095) Only gibberish extracted from this PDF

2014-07-15 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061867#comment-14061867
 ] 

Hong-Thai Nguyen commented on TIKA-1095:


Event with latest Tika can't convert this file. It seems that a font problem on 
this PDF file. Can you report this to PDFBox tracker: 
https://issues.apache.org/jira/browse/PDFBOX/ ?

 Only gibberish extracted from this PDF
 --

 Key: TIKA-1095
 URL: https://issues.apache.org/jira/browse/TIKA-1095
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.3
 Environment: Probably any
Reporter: Bas van Meurs
  Labels: patch
 Attachments: ALG 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks 
 bestuur d d  10 februari 2010.pdf, test.txt


 java -jar /usr/share/tika/tika-app-1.3.jar -t 
 /home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 
 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks bestuur d d  10 februari 
 2010.pdf  /tmp/test.txt
 This produces all gibberish.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1095) Only gibberish extracted from this PDF

2014-07-15 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1095:
---

Component/s: (was: general)
 parser

 Only gibberish extracted from this PDF
 --

 Key: TIKA-1095
 URL: https://issues.apache.org/jira/browse/TIKA-1095
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Probably any
Reporter: Bas van Meurs
  Labels: pdfbox
 Attachments: ALG 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks 
 bestuur d d  10 februari 2010.pdf, test.txt


 java -jar /usr/share/tika/tika-app-1.3.jar -t 
 /home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 
 2010-05-19 03 bijlage 1 -  besluitenlijst dagelijks bestuur d d  10 februari 
 2010.pdf  /tmp/test.txt
 This produces all gibberish.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040519#comment-14040519
 ] 

Hong-Thai Nguyen commented on TIKA-1350:


Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1 to 
Maven Center (ref. 
https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254)

When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix.

 OutlookPSTParser: Unknown message type: IPM.Note
 

 Key: TIKA-1350
 URL: https://issues.apache.org/jira/browse/TIKA-1350
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Jonathan Evans
  Labels: libpst, parser, pst
 Fix For: 1.7

   Original Estimate: 0.2h
  Remaining Estimate: 0.2h

 When parsing some emails in a PST file I get the error Unknown message type: 
 IPM.Note preventing them from being parsed. This is because of an extra null 
 byte at the end of the message class string.
 This has been fixed in version 0.8.1 of java-libpst so a version bump is all 
 that is required. 
 https://github.com/rjohnsondev/java-libpst/issues/14
 I would attempt to do this myself but I am unsure how to open a pull request 
 with SVN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-05-26 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008704#comment-14008704
 ] 

Hong-Thai Nguyen commented on TIKA-1308:


A virtual FileSystem may be a solution, If you're on Java 7. The NIO APIs with 
FileSytemProvider [1] allows you define or inject a Virtual FileSystem (eg. 
Common VFS [2]).

[1] 
http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileSystemProvider.html
[2] http://commons.apache.org/proper/commons-vfs/filesystems.html






 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: yuanyun.cn
  Labels: gae
 Fix For: 1.6


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 This fails with exception:
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
 Checked the code, in 
 org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
 Metadata, ParseContext), it creates a temp file from the input stream.
 I can understand why tika create temp file from the stream: so tika can parse 
 it multiple times.
 But as GAE and other cloud servers are getting more popular, is it possible 
 to avoid create temp file: instead we can copy the origin stream to a 
 byteArray stream, so tika can also parse it multiple times.
 -- This will have a limit on the file size, as tika keeps the whole file in 
 memory, but this can make tika work in GAE and maybe other cloud server.
 We can add a parameter in parser.parse to indicate whether do in memory parse 
 only.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1290) Upgrade to PDFBOX 1.8.5

2014-05-06 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1290:
---

Labels: trivial  (was: )

 Upgrade to PDFBOX 1.8.5
 ---

 Key: TIKA-1290
 URL: https://issues.apache.org/jira/browse/TIKA-1290
 Project: Tika
  Issue Type: Improvement
Reporter: Hong-Thai Nguyen
  Labels: trivial

 PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent
 We can update to this version, and eventually test  fix also TIKA-1231



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1290) Upgrade to PDFBOX 1.8.5

2014-05-06 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1290.


Resolution: Fixed

r1592780

 Upgrade to PDFBOX 1.8.5
 ---

 Key: TIKA-1290
 URL: https://issues.apache.org/jira/browse/TIKA-1290
 Project: Tika
  Issue Type: Improvement
Reporter: Hong-Thai Nguyen
  Labels: trivial

 PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent
 We can update to this version, and eventually test  fix also TIKA-1231



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1287) Update NetCDF .jar file on Maven Central

2014-05-02 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987521#comment-13987521
 ] 

Hong-Thai Nguyen commented on TIKA-1287:


Technically, not difficult to upload new jar lib on Maven Center, you follow 
just steps mention by [~gagravarr], I did recently for java-pstlib.
BTW, we must care about license of lib if you are not the author of this lib. 
See 
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/documentation.htm,
 netCDF's license not not Apache license. You should contact them first to ask 
authorization if you want to upload yourself this lib.

 Update NetCDF .jar file on Maven Central
 

 Key: TIKA-1287
 URL: https://issues.apache.org/jira/browse/TIKA-1287
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Ann Burgess
  Labels: jar, maven, netcdf, tika, unit-test, update

 I am working to update the NetCDFParser file.  When using the most-recent 
 .jar file available from http://www.unidata.ucar.edu/ at the command line I 
 receive a note about a depreciated API: 
 javac -classpath 
 ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar
  org/apache/tika/parser/netcdf/NetCDFParser.java
 Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a 
 deprecated API.
 Note: Recompile with -Xlint:deprecation for details.
 After updating the NetCDFParser file with non-deprecated methods (e.x. 
 changing dimension.getName() to dimension.getFullName()) however, I get 
 failed unit tests in maven, which I assume is because the Maven Central Repo 
 has the lapsed version of the .jar file needed for NetCDF files (
 http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22)
  .
 Can anyone provide insight into how I get the updated .jar file into the 
 Maven Central Repository? Is there an alternative method to update Tika so I 
 can run my unit tests in Maven?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1290) Upgrade to PDFBOX 1.8.5

2014-05-02 Thread Hong-Thai Nguyen (JIRA)

Hong-Thai Nguyen created TIKA-1290:
--

 Summary: Upgrade to PDFBOX 1.8.5
 Key: TIKA-1290
 URL: https://issues.apache.org/jira/browse/TIKA-1290
 Project: Tika
  Issue Type: Improvement
Reporter: Hong-Thai Nguyen


PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent

We can update to this version, and eventually test  fix also TIKA-1231



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983434#comment-13983434
 ] 

Hong-Thai Nguyen commented on TIKA-1283:


+1 for me to create a thumbnail field in metadata Set.
- For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has 
always embedded thumbnail in Jpeg, but optional with docx  xlsx (available 
only when user check on 'save preview' option when saving document).
- For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can 
get thumbnail content from POI API:
{code}
static byte[] process(File docFile) throws Exception {
final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
SummaryInformation summaryInformation = 
wordDocument.getSummaryInformation();
System.out.println(summaryInformation.getAuthor());
System.out.println(summaryInformation.getApplicationName() + : + 
summaryInformation.getTitle());
Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
System.out.println(thumbnail.getClipboardFormat());
System.out.println(thumbnail.getClipboardFormatTag());
return thumbnail.getThumbnailAsWMF();
  }
{code}
Unfortunately , there's an open bug on POI to get properly thumbnail content: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=56194
docx, xlsx  ole formats, they are WMF  EMF formats. Quite difficult to handle 
these kind of images. But, this is out of our scope.


 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)

Hong-Thai Nguyen created TIKA-1279:
--

 Summary: Missing return lines at output of SourceCodeParser
 Key: TIKA-1279
 URL: https://issues.apache.org/jira/browse/TIKA-1279
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Trivial
 Fix For: 1.6


xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979614#comment-13979614
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Thank [~ben.12] for feedback.
For line return problem at output, I created a new issue: TIKA-1279
For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could 
be text/plain (in this case, TxtParser will be used to return original text as 
is), x-java-source (SourceCodeParser will be used).

For -h option, output is normally something:
{code}
Author: Hong-Thai.Nguyen
Content-Encoding: windows-1252
Content-Length: 4899
Content-Type: text/x-java-source
LoC: 133
creator: Hong-Thai.Nguyen
dc:creator: Hong-Thai.Nguyen
meta:author: Hong-Thai.Nguyen
resourceName: SourceCodeParser.java
{code}
the creator is from 'author' annotation in javadoc.

This parser is quite generic (quick and dirty as mentioned by [~kkrugler]) and 
simplistic. We can make a more dedicate Java source parser and extract more 
metadata (member, attributes...). If you interest this kind of parser, please 
create new issue and eventually an investigation on this work is warmly welcome.

Regards,

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1279.


Resolution: Fixed

Fixed at r1589687

 Missing return lines at output of SourceCodeParser
 --

 Key: TIKA-1279
 URL: https://issues.apache.org/jira/browse/TIKA-1279
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Trivial
 Fix For: 1.6


 xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1276:
---

Fix Version/s: 1.6

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However when extracting EXIG metadata from a jpeg image I got the following 
 exception.
 {code}
 java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
   at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   [..]
 {code}
 Embedding xmpcore in the tika-bundle solved this issue.
 NOTES:
 * The Apache Stanbol integration tests only covers PDF, JPEG, DOCX. So there 
 might be

[jira] [Resolved] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1276.


Resolution: Fixed

Thank [~rwesten], added your patch at r1589717

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However when extracting EXIG metadata from a jpeg image I got the following 
 exception.
 {code}
 java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
   at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   [..]
 {code}
 Embedding xmpcore in the tika-bundle solved this issue.
 NOTES:
 * The Apache Stanbol integration tests

[jira] [Resolved] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1279.


Resolution: Fixed

Thank [~rgauss] for this good catch. I fixed with more tests in r1589742
Hoping that we can move away Java 6 soon :)

 Missing return lines at output of SourceCodeParser
 --

 Key: TIKA-1279
 URL: https://issues.apache.org/jira/browse/TIKA-1279
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Trivial
 Fix For: 1.6


 xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-623) Add support for Outlook PST

2014-04-04 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-623:
--

Assignee: (was: Hong-Thai Nguyen)

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1244) Better parsing of Mbox files

2014-03-31 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1244.


   Resolution: Fixed
Fix Version/s: 1.6

Commited on r1583305, thanks [~lfcnassif]
I preserved metadata extraction from current MboxParser because message/rfc822  
seems not enable extract all fields in header.

 Better parsing of Mbox files
 

 Key: TIKA-1244
 URL: https://issues.apache.org/jira/browse/TIKA-1244
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Luis Filipe Nassif
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: MboxParser.java.patch


 MboxParser currently looses metadata of all emails, except first. It does not 
 extract/parse emails, nor decode parts. It should handle embedded emails like 
 other container parsers do, so emails will be automatically parsed by 
 RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files

2014-03-28 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1244:
--

Assignee: Hong-Thai Nguyen

 Better parsing of Mbox files
 

 Key: TIKA-1244
 URL: https://issues.apache.org/jira/browse/TIKA-1244
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Luis Filipe Nassif
Assignee: Hong-Thai Nguyen
 Attachments: MboxParser.java.patch


 MboxParser currently looses metadata of all emails, except first. It does not 
 extract/parse emails, nor decode parts. It should handle embedded emails like 
 other container parsers do, so emails will be automatically parsed by 
 RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1244) Better parsing of Mbox files

2014-03-21 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942965#comment-13942965
 ] 

Hong-Thai Nguyen commented on TIKA-1244:


+1 for me too, I was at same intention to redo this parser when making PST. 
I'll have some next week, and hope can have a look on your patch. Thanks

 Better parsing of Mbox files
 

 Key: TIKA-1244
 URL: https://issues.apache.org/jira/browse/TIKA-1244
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Luis Filipe Nassif
 Attachments: MboxParser.java.patch


 MboxParser currently looses metadata of all emails, except first. It does not 
 extract/parse emails, nor decode parts. It should handle embedded emails like 
 other container parsers do, so emails will be automatically parsed by 
 RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-07 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923703#comment-13923703
 ] 

Hong-Thai Nguyen commented on TIKA-623:
---

[~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree 
that we can split each mail to a separate unit.
[~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled 
as part of OfficeParser), and feel free to finish properly this issue as you 
can :)

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)

Hong-Thai Nguyen created TIKA-1257:
--

 Summary: MS Word Filter out control characters on ouput
 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6


Control characters present mostly in table of index and un-visualizable. We 
should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: tika-doc-control-char.png
5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
 tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1257.


Resolution: Fixed

Fixed on r1574874

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
 tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922490#comment-13922490
 ] 

Hong-Thai Nguyen edited comment on TIKA-1257 at 3/6/14 1:50 PM:


Fixed on r1574874  r1574877


was (Author: thaichat04):
Fixed on r1574874

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, 
 tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: (was: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc)

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput

2014-03-06 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1257:
---

Attachment: testControlCharacters.doc

 MS Word Filter out control characters on ouput
 --

 Key: TIKA-1257
 URL: https://issues.apache.org/jira/browse/TIKA-1257
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: testControlCharacters.doc, tika-doc-control-char.png


 Control characters present mostly in table of index and un-visualizable. We 
 should filter out them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-623:
--

Fix Version/s: 1.6

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920692#comment-13920692
 ] 

Hong-Thai Nguyen edited comment on TIKA-623 at 3/5/14 9:30 AM:
---

java-libpst-0.7 has been uploaded to oss sonatype nexus: 
https://issues.sonatype.org/browse/OSSRH-8965
If there's no objection, I'll refactory attached parser and provide output as:
{code}
html xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Content-Length content=271360 /
meta name=isValid content=true /
meta name=Content-Type content=application/vnd.ms-outlook /
title/title
/head
body
div class=email-folder
h1Début du fichier de données Outlook/h1
div class=email-entry
h1lt;530d9cac.5080...@gmail.comgt;/h1
meta subject=Re: Feature Generators /
meta 
internetMessageId=lt;530d9cac.5080...@gmail.comgt; /
meta descriptorNodeId=2097188 /
meta lastModificationTime=1393418263291 /
meta senderName=Jörn Kottmann /
meta senderEmailAddress=kottm...@gmail.com /
meta recipients=No recipients table! /
pmail content/p
/div
div class=email-folder
h1Éléments supprimés/h1
/div
/div
div class=email-folder
h1Racine (pour la recherche)/h1
/div
div class=email-folder
h1SPAM Search Folder 2/h1
/div
/body
/html
{code}


was (Author: thaichat04):
java-libpst-0.7 has been uploaded to oss sonatype nexus. If there's no 
objection, I'll refactory attached parser and provide output as:
{code}
html xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Content-Length content=271360 /
meta name=isValid content=true /
meta name=Content-Type content=application/vnd.ms-outlook /
title/title
/head
body
div class=email-folder
h1Début du fichier de données Outlook/h1
div class=email-entry
h1lt;530d9cac.5080...@gmail.comgt;/h1
meta subject=Re: Feature Generators /
meta 
internetMessageId=lt;530d9cac.5080...@gmail.comgt; /
meta descriptorNodeId=2097188 /
meta lastModificationTime=1393418263291 /
meta senderName=Jörn Kottmann /
meta senderEmailAddress=kottm...@gmail.com /
meta recipients=No recipients table! /
pmail content/p
/div
div class=email-folder
h1Éléments supprimés/h1
/div
/div
div class=email-folder
h1Racine (pour la recherche)/h1
/div
div class=email-folder
h1SPAM Search Folder 2/h1
/div
/body
/html
{code}

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-623:
-

Assignee: Hong-Thai Nguyen

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-623.
---

Resolution: Fixed

Commit on r1574411

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1089) Tika conversion failed on following documents

2014-02-17 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1089.


   Resolution: Invalid
Fix Version/s: 1.5
 Assignee: Hong-Thai Nguyen

Should create each issue by file, then investigate to resolve one by one.

 Tika conversion failed on following documents
 -

 Key: TIKA-1089
 URL: https://issues.apache.org/jira/browse/TIKA-1089
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: windows, api
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
  Labels: test
 Fix For: 1.5

 Attachments: crawler.log


 We are using Tika as our major converter of divers file formats to text, html 
 version in a Search Engine.
 We've collected some documents (46) which Tika can not convert: 
 http://www.mediafire.com/?60clr812lerx3gy



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Assigned] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-17 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1223:
--

Assignee: Hong-Thai Nguyen

 Extract thumbnail of OOXML Office files
 ---

 Key: TIKA-1223
 URL: https://issues.apache.org/jira/browse/TIKA-1223
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-1223.patch


 From Microsoft Office 2007 file formats, thumbnail could be included in 
 package. We can extract this embedded thumbnail for OOXML files.
 As discussed in mailing list, we should extract thumbnail as a attachment, 
 not as metadata (TIKA-90).
 {noformat}
 embeddedRelationId format is thumbnail_{i}.{extension}.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-17 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1223.


Resolution: Fixed

r1568954

 Extract thumbnail of OOXML Office files
 ---

 Key: TIKA-1223
 URL: https://issues.apache.org/jira/browse/TIKA-1223
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-1223.patch


 From Microsoft Office 2007 file formats, thumbnail could be included in 
 package. We can extract this embedded thumbnail for OOXML files.
 As discussed in mailing list, we should extract thumbnail as a attachment, 
 not as metadata (TIKA-90).
 {noformat}
 embeddedRelationId format is thumbnail_{i}.{extension}.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Assigned] (TIKA-1223) Extract thumbnail of OOXML Office files

2014-02-17 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1223:
--

Assignee: (was: Hong-Thai Nguyen)

 Extract thumbnail of OOXML Office files
 ---

 Key: TIKA-1223
 URL: https://issues.apache.org/jira/browse/TIKA-1223
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Hong-Thai Nguyen
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-1223.patch


 From Microsoft Office 2007 file formats, thumbnail could be included in 
 package. We can extract this embedded thumbnail for OOXML files.
 As discussed in mailing list, we should extract thumbnail as a attachment, 
 not as metadata (TIKA-90).
 {noformat}
 embeddedRelationId format is thumbnail_{i}.{extension}.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1224.


Resolution: Fixed

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889491#comment-13889491
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Commited on 1563902

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-01-21 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877343#comment-13877343
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


I agree that parsing deeply each language is not simple. This work (already 
done) is just providing HTML format of source languages and some metadata 
possible (as author, version ...) extracting from javadoc comment and probably 
interesting others as LoC. When we need more detailed result on a language, we 
must implement a dedicated parser.
This parser is useful in search application.

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-14 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870573#comment-13870573
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


Great catch. Thank [~jukkaz]

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: tika-1215-without-wildcard.patch

[~gagravarr], my code style is different the one of Apache convention. 
Apologize for that.
I attached new patch file containing changes only.

Thanks


 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869590#comment-13869590
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


[~talli...@apache.org], here's XML of input to parse:
{noformat}
h1 xmlns=http://www.w3.org/1999/xhtml;Matin Première - Tour des régions 
080806/h1
pRTBF - La Première/p
pSpeech/p
p101698.914/p
pXXX - 
A propos du contrat de quartier rues Dublin/Dubreucq/p
{noformat}

I think this regression came from TIKA-1070
{code}
currentElement = currentElement.parent;
{code}

The parentElement of p is null, then getPrefix() raised exception, that's 
different from 1.4

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
 tika-1215-without-wildcard.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-90) Allow thumbnails as document metadata

2014-01-09 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866498#comment-13866498
 ] 

Hong-Thai Nguyen commented on TIKA-90:
--

Useful for Open XML Office  OpenOffice files and some others with embedded 
thumbnail.

 Allow thumbnails as document metadata
 -

 Key: TIKA-90
 URL: https://issues.apache.org/jira/browse/TIKA-90
 Project: Tika
  Issue Type: New Feature
  Components: general
Reporter: Jukka Zitting

 It would be nice if parser components could produce thumbnail images and 
 other non-string metadata when parsing documents.
 To do this, we could either generalize the current Metadata methods, or 
 introduce new methods for handling such non-string metadata.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864202#comment-13864202
 ] 

Hong-Thai Nguyen commented on TIKA-1216:


I've test with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215.

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-07 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: TIKA-1215-fix-prefix-namespaces.patch

I made a fix with a test for this issue. Please have a revision and commit 
quickly. Thanks

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864202#comment-13864202
 ] 

Hong-Thai Nguyen edited comment on TIKA-1216 at 1/7/14 3:57 PM:


I've tested with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215. A patch has been submitted on this issue.
Waiting for a revision  commit.

Thanks


was (Author: thaichat04):
I've test with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215.

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


[~davemeikle], here's a sample test failed on this file:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {

  private ConverterConfiguration config;
  private CompositeParser parser;
  
  @Before
  public void before() throws Exception {
  config = new ConverterConfiguration();
  config.setMimeToConverter(src/test/resources/mimeToConverter.xml);
  config.setSizeLimit(40);
  TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim());
  parser = (CompositeParser) tikaConf.getParser();
  }
  
  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, UTF-8); // Extract always HTML by default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream(/mp3/test.mp3);
try {
  ParseContext context = new ParseContext();   // parsing
  context.set(Parser.class, parser);
  parser.parse(input, bodyHandler, new Metadata(), context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString(UTF-8);
assertThat(output).isNotEmpty(); // failed
  }
}
{code}

 Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
 ---

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Summary: Regression: Unable to parse a mp3 file on 1.5 which parsed 
successfully on 1.4  (was: Regression: Unable parse a mp3 file on 1.5 which 
parsed successfully on 1.4)

 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
 --

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 3:11 PM:


[~davemeikle], here's a sample test failed on this file:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {

  private ConverterConfiguration config;
  private CompositeParser parser;

  @Before
  public void before() throws Exception {
config = new ConverterConfiguration();
config.setMimeToConverter(src/test/resources/mimeToConverter.xml);
config.setSizeLimit(40);
TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim());
parser = (CompositeParser) tikaConf.getParser();
  }

  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, UTF-8); // Extract

 // always

 // HTML

 // by

 // default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream(/mp3/test.mp3);
try {
  ParseContext context = new ParseContext(); // parsing
  context.set(Parser.class, parser);
  Metadata metadata = new Metadata();
  metadata.add(Metadata.RESOURCE_NAME_KEY, 12345);
  metadata.add(Metadata.CONTENT_TYPE, audio/mpeg);
  parser.parse(input, bodyHandler, metadata, context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString(UTF-8);

assertThat(output).isNotEmpty(); // failed
  }

}
{code}

Here's stack error
{noformat}
org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 3:12 PM:


[~davemeikle], here's a sample test failed on this file with 1.5-SNAPSHOT, but 
passed on 1.4:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {

  private ConverterConfiguration config;
  private CompositeParser parser;

  @Before
  public void before() throws Exception {
config = new ConverterConfiguration();
config.setMimeToConverter(src/test/resources/mimeToConverter.xml);
config.setSizeLimit(40);
TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim());
parser = (CompositeParser) tikaConf.getParser();
  }

  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, UTF-8); // Extract

 // always

 // HTML

 // by

 // default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream(/mp3/test.mp3);
try {
  ParseContext context = new ParseContext(); // parsing
  context.set(Parser.class, parser);
  Metadata metadata = new Metadata();
  metadata.add(Metadata.RESOURCE_NAME_KEY, 12345);
  metadata.add(Metadata.CONTENT_TYPE, audio/mpeg);
  parser.parse(input, bodyHandler, metadata, context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString(UTF-8);

assertThat(output).isNotEmpty(); // failed
  }

}
{code}

Here's stack error
{noformat}
org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 5:20 PM:


[~davemeikle], here's a sample test failed on this file with 1.5-SNAPSHOT, but 
passed on 1.4:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {
  private CompositeParser parser;

  @Before
  public void before() throws Exception {
TikaConfig tikaConf = new TikaConfig();
parser = (CompositeParser) tikaConf.getParser();
  }

  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, UTF-8); // Extract

 // always

 // HTML

 // by

 // default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream(/mp3/test.mp3);
try {
  ParseContext context = new ParseContext(); // parsing
  context.set(Parser.class, parser);
  Metadata metadata = new Metadata();
  metadata.add(Metadata.RESOURCE_NAME_KEY, 12345);
  metadata.add(Metadata.CONTENT_TYPE, audio/mpeg);
  parser.parse(input, bodyHandler, metadata, context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString(UTF-8);

assertThat(output).isNotEmpty(); // failed

System.out.println(output);
  }
}
{code}

Here's stack error
{noformat}
org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at

[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-12-27 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857418#comment-13857418
 ] 

Hong-Thai Nguyen commented on TIKA-1152:


Thank [~jukkaz], I've checked on trunk. Seems ok now.

 Process loops infinitely on parsing of a CHM file
 -

 Key: TIKA-1152
 URL: https://issues.apache.org/jira/browse/TIKA-1152
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Assignee: Jukka Zitting
Priority: Critical
 Fix For: 1.5

 Attachments: ChmLzxBlock.java.patch, eventcombmt.chm


 By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
 Files), Java process stuck.
 {code}
 Thread[main,5,main]
   
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
   org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77)
   
 org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
   
 org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
   
 org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
   
 com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: Centres 080805@0650 RTBF Matin Première - A propos des rues de 
Dublin et Dubreucq.mp3

 Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
 ---

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)

Hong-Thai Nguyen created TIKA-1215:
--

 Summary: Regression: Unable parse a mp3 file on 1.5 which parsed 
successfully on 1.4
 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
rues de Dublin et Dubreucq.mp3

With attached file, 1.5 raises this exception on parsing. This file has no 
problem on 1.4
{code}
...
Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not 
declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
... 15 more
{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857542#comment-13857542
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 12/27/13 3:59 PM:
--

I built on latest trunk of git://git.apache.org/tika.git and via Java API


was (Author: thaichat04):
I built on latest trunk of git://git.apache.org/tika.git

 Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
 ---

 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
 rues de Dublin et Dubreucq.mp3


 With attached file, 1.5 raises this exception on parsing. This file has no 
 problem on 1.4
 {code}
 ...
 Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
 not declared
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
   at 
 org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
   at 
 org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
   ... 15 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-12-23 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855528#comment-13855528
 ] 

Hong-Thai Nguyen commented on TIKA-1152:


[~gagravarr] or anyone can have look at patch in integrate to trunk before 
release 1.5 please ?
Merci

 Process loops infinitely on parsing of a CHM file
 -

 Key: TIKA-1152
 URL: https://issues.apache.org/jira/browse/TIKA-1152
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5

 Attachments: ChmLzxBlock.java.patch, eventcombmt.chm


 By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
 Files), Java process stuck.
 {code}
 Thread[main,5,main]
   
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
   org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77)
   
 org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
   
 org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
   
 org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
   
 com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Hong-Thai Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845398#comment-13845398
 ] 

Hong-Thai Nguyen commented on TIKA-1205:


Just a (newbie) question, why limit only on PDFParser, not for any other parser 
?
I agree that fallback is necessary when having exception. But, the worst case 
is infinitive loop happens when parsing a document.

For these two purposes, we would generalize to handle exception and timeout 
properly in a wrapper ?

 Allow PDFParser to fallback to other parser if there is an exception
 

 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.5


 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
 instead of the traditional parser for parsing PDF files.  Following the 
 description in PDFBOX-1199, it would be useful to allow fallback to the 
 classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
 of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
 this parameter is true, and if Tika's PDFParser is using the classic parser, 
 Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
 if this parameter is true and if Tika's PDFParser is using the 
 NonSequentialPDFParser it will fallback to the classic parser if there is an 
 IOException.
 Many thanks to Hong-Thai for championing the addition of the added 
 NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
 PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

1 2 >

1 - 100 of 119 matches

Mail list logo