[jira] [Commented] (TIKA-245) Support of CHM Format

2014-02-03 Thread Prashanth Ramaswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889317#comment-13889317
 ] 

Prashanth Ramaswamy commented on TIKA-245:
--

Nick, Thanks for your response.  Unfortunately, I am constrained from uploading 
the chm file for which I'm encountering the exception.  I may have to see if 
there are other chm files for which the same exception gets thrown.

 Support of CHM Format
 -

 Key: TIKA-245
 URL: https://issues.apache.org/jira/browse/TIKA-245
 Project: Tika
  Issue Type: New Feature
  Components: parser
 Environment: All
Reporter: Karl Heinz Marbaise
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 0.10

 Attachments: TIKA-245.oleg.20110806.PATCH, 
 TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt, 
 TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt


 It might be a good idea to support the CHM File format of Windows. Some 
 information about 
 http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. 
 The CHM format contains HTML files which can be parsed by Tika. So the only 
 problem is to extract the data from the CHM file.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1227) Apache Tika 1.4 Duplicate extract data

2014-02-03 Thread vivek joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vivek joshi updated TIKA-1227:
--

Attachment: tt1.doc

File for which the Duplicated text is coming. 

Duplicate text from the heading DEFINITIONS 

 Apache Tika 1.4 Duplicate extract data
 --

 Key: TIKA-1227
 URL: https://issues.apache.org/jira/browse/TIKA-1227
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.4
 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4
Reporter: vivek joshi
  Labels: python, tika,text-extraction, ubuntu
 Attachments: tt1.doc


 When Extracting text using Apache Tika 1.4, the Text is getting duplicated.
 APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, 
 apache_tika/tika-app-1.4.jar'))
 sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, 
 document),shell=True)
 sout contains duplicate text.
 Issue both for Doc and PDF files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1227) Apache Tika 1.4 Duplicate extract data

2014-02-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889387#comment-13889387
 ] 

Nick Burch commented on TIKA-1227:
--

I've just tried running tika-app directly on the command line, against your 
file, and I don't see any duplication of DEFINITIONS

$ java -jar tika-app-1.5-SNAPSHOT.jar --text /tmp/tt1.doc | grep DEFIN
DEFINITIONS
$

I can only suggest you try running the Tika app manually from the commandline 
yourself, to check the issue, then investigate your python code when you're 
happy with Tika itself

 Apache Tika 1.4 Duplicate extract data
 --

 Key: TIKA-1227
 URL: https://issues.apache.org/jira/browse/TIKA-1227
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.4
 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4
Reporter: vivek joshi
  Labels: python, tika,text-extraction, ubuntu
 Attachments: tt1.doc


 When Extracting text using Apache Tika 1.4, the Text is getting duplicated.
 APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, 
 apache_tika/tika-app-1.4.jar'))
 sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, 
 document),shell=True)
 sout contains duplicate text.
 Issue both for Doc and PDF files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1227) Apache Tika 1.4 Duplicate extract data

2014-02-03 Thread vivek joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889395#comment-13889395
 ] 

vivek joshi commented on TIKA-1227:
---

Thanks Nick Burch,

I tried on command line and it is running well but if i try it from the Python 
script then it gives duplicate text.

Please suggest.


 Apache Tika 1.4 Duplicate extract data
 --

 Key: TIKA-1227
 URL: https://issues.apache.org/jira/browse/TIKA-1227
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.4
 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4
Reporter: vivek joshi
  Labels: python, tika,text-extraction, ubuntu
 Attachments: tt1.doc


 When Extracting text using Apache Tika 1.4, the Text is getting duplicated.
 APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, 
 apache_tika/tika-app-1.4.jar'))
 sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, 
 document),shell=True)
 sout contains duplicate text.
 Issue both for Doc and PDF files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Closed] (TIKA-1227) Apache Tika 1.4 Duplicate extract data

2014-02-03 Thread vivek joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vivek joshi closed TIKA-1227.
-

   Resolution: Invalid
Fix Version/s: 1.4

 Apache Tika 1.4 Duplicate extract data
 --

 Key: TIKA-1227
 URL: https://issues.apache.org/jira/browse/TIKA-1227
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.4
 Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4
Reporter: vivek joshi
  Labels: python, tika,text-extraction, ubuntu
 Fix For: 1.4

 Attachments: tt1.doc


 When Extracting text using Apache Tika 1.4, the Text is getting duplicated.
 APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, 
 apache_tika/tika-app-1.4.jar'))
 sout = subprocess.check_output(java -jar %s -t %s%(APACHE_TIKA_PATH, 
 document),shell=True)
 sout contains duplicate text.
 Issue both for Doc and PDF files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1224.


Resolution: Fixed

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser

2014-02-03 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889491#comment-13889491
 ] 

Hong-Thai Nguyen commented on TIKA-1224:


Commited on 1563902

 Adding Source code (Java, Groovy, C) parser
 ---

 Key: TIKA-1224
 URL: https://issues.apache.org/jira/browse/TIKA-1224
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Minor

 We can parser some source code file formats:
 text/x-java-source
 text/x-groovy
 text/x-c
 for HTML rendering from code, we can use jhightlight: 
 http://www.ohloh.net/p/jhighlight



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Jason Sherman (JIRA)
Jason Sherman created TIKA-1228:
---

 Summary: Embedded files not extracted properly from PDF
 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman


IAW pdfbox example here:

http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java

the PDF parser does not check for additional entries under Kids node when Names 
node does not exist.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison commented on TIKA-1228:
---

I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison edited comment on TIKA-1228 at 2/3/14 6:09 PM:
---

I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{noformat}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{noformat}

where processEmbedded is shorthand for the existing code:
{noformat}
if (embeddedFileNames != null){
...
}
{noformat}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)


was (Author: talli...@mitre.org):
I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889697#comment-13889697
 ] 

Tim Allison edited comment on TIKA-1228 at 2/3/14 6:11 PM:
---

I won't have time to fix this for a week or so, but, I'll take this unless 
another committer has time sooner.


was (Author: talli...@mitre.org):
I won't have time to fix this for a week or so, but it looks like the client 
(Tika) needs to look through the kids of embeddedFiles recursively (well, in 
this file, just one level down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{noformat}
MapString, COSObjectable embeddedFileNames = embeddedFiles.getNames();
ListPDNameTreeNode kids = embeddedFiles.getKids();
for (PDNameTreeNode n : kids){
MapString, COSObjectable embeddedFileNames = n.getNames();
processEmbedded(embeddedFileNames, embeddedExtractor);

{noformat}

where processEmbedded is shorthand for the existing code:
{noformat}
if (embeddedFileNames != null){
...
}
{noformat}

We can fix this at the Tika level in the short term.  I'm not sure if this is 
the expected behavior in PDFBox.  At the least we might want to request that 
this line in the javadoc to PDDocumentNameDictionary: (The value in this name 
tree will be PDComplexFileSpecification objects.) be changed to The value in 
this name tree or its children will be PDComplexFileSpecification objects.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1228.
---

   Resolution: Fixed
Fix Version/s: 1.5

Fixed in r1564042.

Thank you, [~agi20dla], for reporting this and diagnosing the cause and 
solution for this bug!

I'm resolving this for now.  I'm waiting to hear back from users@pdfbox to see 
if we should search recursively for non-null attachment data.  The example that 
you provided does show only checking the children.  I'll reopen this issue if 
we need to switch to full recursion.

Thank you, again.

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Issue Comment Deleted] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1228:
--

Comment: was deleted

(was: I won't have time to fix this for a week or so, but, I'll take this 
unless another committer has time sooner.)

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF

2014-02-03 Thread Jason Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889889#comment-13889889
 ] 

Jason Sherman commented on TIKA-1228:
-

Thanks for the help.  Another possibly related issue is:
When I was stepping through the pdfbox code, line 286 throws an exception when 
running, but processes properly in my evaluation dialog (Intellij 13)

namesArray = 
(COSArray)((COSDictionary)((COSArray)node.getDictionaryObject(COSName.KIDS)).get(0)).getDictionaryObject(COSName.NAMES);

Throws:
org.apache.pdfbox.cos.COSObject cannot be cast to 
org.apache.pdfbox.cos.COSDictionary

Do you want to pass that on to the pdfbox folks, or should I report it 
separately?

 Embedded files not extracted properly from PDF
 --

 Key: TIKA-1228
 URL: https://issues.apache.org/jira/browse/TIKA-1228
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: CentOS 6.5 VM
Reporter: Jason Sherman
  Labels: easyfix
 Fix For: 1.5

 Attachments: pdf_with_doc_and_text_attached.pdf


 IAW pdfbox example here:
 http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
 the PDF parser does not check for additional entries under Kids node when 
 Names node does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)