[jira] [Commented] (TIKA-245) Support of CHM Format
[ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889317#comment-13889317 ] Prashanth Ramaswamy commented on TIKA-245: -- Nick, Thanks for your response. Unfortunately, I am constrained from uploading the chm file for which I'm encountering the exception. I may have to see if there are other chm files for which the same exception gets thrown. Support of CHM Format - Key: TIKA-245 URL: https://issues.apache.org/jira/browse/TIKA-245 Project: Tika Issue Type: New Feature Components: parser Environment: All Reporter: Karl Heinz Marbaise Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.10 Attachments: TIKA-245.oleg.20110806.PATCH, TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt It might be a good idea to support the CHM File format of Windows. Some information about http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format contains HTML files which can be parsed by Tika. So the only problem is to extract the data from the CHM file. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1227) Apache Tika 1.4 Duplicate extract data
[ https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vivek joshi updated TIKA-1227: -- Attachment: tt1.doc File for which the Duplicated text is coming. Duplicate text from the heading DEFINITIONS Apache Tika 1.4 Duplicate extract data -- Key: TIKA-1227 URL: https://issues.apache.org/jira/browse/TIKA-1227 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.4 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4 Reporter: vivek joshi Labels: python, tika,text-extraction, ubuntu Attachments: tt1.doc When Extracting text using Apache Tika 1.4, the Text is getting duplicated. APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar')) sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, document),shell=True) sout contains duplicate text. Issue both for Doc and PDF files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1227) Apache Tika 1.4 Duplicate extract data
[ https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889387#comment-13889387 ] Nick Burch commented on TIKA-1227: -- I've just tried running tika-app directly on the command line, against your file, and I don't see any duplication of DEFINITIONS $ java -jar tika-app-1.5-SNAPSHOT.jar --text /tmp/tt1.doc | grep DEFIN DEFINITIONS $ I can only suggest you try running the Tika app manually from the commandline yourself, to check the issue, then investigate your python code when you're happy with Tika itself Apache Tika 1.4 Duplicate extract data -- Key: TIKA-1227 URL: https://issues.apache.org/jira/browse/TIKA-1227 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.4 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4 Reporter: vivek joshi Labels: python, tika,text-extraction, ubuntu Attachments: tt1.doc When Extracting text using Apache Tika 1.4, the Text is getting duplicated. APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar')) sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, document),shell=True) sout contains duplicate text. Issue both for Doc and PDF files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1227) Apache Tika 1.4 Duplicate extract data
[ https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889395#comment-13889395 ] vivek joshi commented on TIKA-1227: --- Thanks Nick Burch, I tried on command line and it is running well but if i try it from the Python script then it gives duplicate text. Please suggest. Apache Tika 1.4 Duplicate extract data -- Key: TIKA-1227 URL: https://issues.apache.org/jira/browse/TIKA-1227 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.4 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4 Reporter: vivek joshi Labels: python, tika,text-extraction, ubuntu Attachments: tt1.doc When Extracting text using Apache Tika 1.4, the Text is getting duplicated. APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar')) sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, document),shell=True) sout contains duplicate text. Issue both for Doc and PDF files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Closed] (TIKA-1227) Apache Tika 1.4 Duplicate extract data
[ https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vivek joshi closed TIKA-1227. - Resolution: Invalid Fix Version/s: 1.4 Apache Tika 1.4 Duplicate extract data -- Key: TIKA-1227 URL: https://issues.apache.org/jira/browse/TIKA-1227 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.4 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4 Reporter: vivek joshi Labels: python, tika,text-extraction, ubuntu Fix For: 1.4 Attachments: tt1.doc When Extracting text using Apache Tika 1.4, the Text is getting duplicated. APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar')) sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, document),shell=True) sout contains duplicate text. Issue both for Doc and PDF files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1224. Resolution: Fixed Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889491#comment-13889491 ] Hong-Thai Nguyen commented on TIKA-1224: Commited on 1563902 Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1228) Embedded files not extracted properly from PDF
Jason Sherman created TIKA-1228: --- Summary: Embedded files not extracted properly from PDF Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison commented on TIKA-1228: --- I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {no-format} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {no-format} where processEmbedded is shorthand for the existing code: {no-format} if (embeddedFileNames != null){ ... } {no-format} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison edited comment on TIKA-1228 at 2/3/14 6:09 PM: --- I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {noformat} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {noformat} where processEmbedded is shorthand for the existing code: {noformat} if (embeddedFileNames != null){ ... } {noformat} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) was (Author: talli...@mitre.org): I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {no-format} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {no-format} where processEmbedded is shorthand for the existing code: {no-format} if (embeddedFileNames != null){ ... } {no-format} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697 ] Tim Allison edited comment on TIKA-1228 at 2/3/14 6:11 PM: --- I won't have time to fix this for a week or so, but, I'll take this unless another committer has time sooner. was (Author: talli...@mitre.org): I won't have time to fix this for a week or so, but it looks like the client (Tika) needs to look through the kids of embeddedFiles recursively (well, in this file, just one level down) to get the non-null embeddedFileNames. Something like this does pull out the .doc file: {noformat} MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames(); ListPDNameTreeNode kids = embeddedFiles.getKids(); for (PDNameTreeNode n : kids){ MapString, COSObjectable embeddedFileNames = n.getNames(); processEmbedded(embeddedFileNames, embeddedExtractor); {noformat} where processEmbedded is shorthand for the existing code: {noformat} if (embeddedFileNames != null){ ... } {noformat} We can fix this at the Tika level in the short term. I'm not sure if this is the expected behavior in PDFBox. At the least we might want to request that this line in the javadoc to PDDocumentNameDictionary: (The value in this name tree will be PDComplexFileSpecification objects.) be changed to The value in this name tree or its children will be PDComplexFileSpecification objects.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1228. --- Resolution: Fixed Fix Version/s: 1.5 Fixed in r1564042. Thank you, [~agi20dla], for reporting this and diagnosing the cause and solution for this bug! I'm resolving this for now. I'm waiting to hear back from users@pdfbox to see if we should search recursively for non-null attachment data. The example that you provided does show only checking the children. I'll reopen this issue if we need to switch to full recursion. Thank you, again. Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Issue Comment Deleted] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1228: -- Comment: was deleted (was: I won't have time to fix this for a week or so, but, I'll take this unless another committer has time sooner.) Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
[ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889889#comment-13889889 ] Jason Sherman commented on TIKA-1228: - Thanks for the help. Another possibly related issue is: When I was stepping through the pdfbox code, line 286 throws an exception when running, but processes properly in my evaluation dialog (Intellij 13) namesArray = (COSArray)((COSDictionary)((COSArray)node.getDictionaryObject(COSName.KIDS)).get(0)).getDictionaryObject(COSName.NAMES); Throws: org.apache.pdfbox.cos.COSObject cannot be cast to org.apache.pdfbox.cos.COSDictionary Do you want to pass that on to the pdfbox folks, or should I report it separately? Embedded files not extracted properly from PDF -- Key: TIKA-1228 URL: https://issues.apache.org/jira/browse/TIKA-1228 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: CentOS 6.5 VM Reporter: Jason Sherman Labels: easyfix Fix For: 1.5 Attachments: pdf_with_doc_and_text_attached.pdf IAW pdfbox example here: http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java the PDF parser does not check for additional entries under Kids node when Names node does not exist. -- This message was sent by Atlassian JIRA (v6.1.5#6160)