[jira] [Commented] (TIKA-1504) TikaCoreProperties.DATE not populated for XML files

2015-01-06 Thread Badger (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265864#comment-14265864
 ] 

Badger commented on TIKA-1504:
--

Thanks, I'd come to the same conclusion after experimenting with the Enron 
data. When parsing each email it pretty clear the date sent is the most 
significant, not the file attribute time and it sort of all made sense as to 
what was happening. 

I'd just incorrectly assumed that if there is no meta-data derived created date 
it would use the file time. 

> TikaCoreProperties.DATE not populated for XML files
> ---
>
> Key: TIKA-1504
> URL: https://issues.apache.org/jira/browse/TIKA-1504
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.5, 1.6
> Environment: Windows 7
>Reporter: Badger
>
> Using the default parser configurations it appears when an XML file is parsed 
>  the meta data property for the creation date is not populated. I'm using 
> TikaCoreProperties.DATE which works for other document types but not xml 
> documents.
> This can be confirmed by dropping any xml file into the tika gui or through 
> code. 
> -- 
> I wasn't sure how to go about reporting this as a bug so signed up for JIRA 
> account, apologies if I was meant to send it in to a dev list for triage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1504) TikaCoreProperties.DATE not populated for XML files

2015-01-06 Thread Badger (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Badger resolved TIKA-1504.
--
Resolution: Invalid

> TikaCoreProperties.DATE not populated for XML files
> ---
>
> Key: TIKA-1504
> URL: https://issues.apache.org/jira/browse/TIKA-1504
> Project: Tika
>  Issue Type: Bug
>  Components: core, parser
>Affects Versions: 1.5, 1.6
> Environment: Windows 7
>Reporter: Badger
>
> Using the default parser configurations it appears when an XML file is parsed 
>  the meta data property for the creation date is not populated. I'm using 
> TikaCoreProperties.DATE which works for other document types but not xml 
> documents.
> This can be confirmed by dropping any xml file into the tika gui or through 
> code. 
> -- 
> I wasn't sure how to go about reporting this as a bug so signed up for JIRA 
> account, apologies if I was meant to send it in to a dev list for triage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.7 Release

2015-01-06 Thread Nick Burch

On Tue, 6 Jan 2015, Tyler Palsulich wrote:

A candidate for the Tika 1.7 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
   http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/

The SHA1 checksum of the archive is
   0307a8367ae6f8b1103824fd11337fd89e24e6a4.

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/


Looks good to me, I'm +1

Nick


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Attachment: 03.doc

I'm sorry that I haven't had a chance to kick the tires on the fix for this 
issue.

I just discovered that the current fix is not pulling metadata from embedded 
image files in tika-trunk or tika-1.7-rc2.

Test doc from govdocs1 attached.

We should be extracting these values (at least) in the embedded tiff:

{noformat}
"Data Precision":"8 bits","Image Height":"169 pixels","Image Width":"752 
pixels","Number of Components":"3","Resolution Units":"inch","X 
Resolution":"300 dots","Y Resolution":"300 
dots","resourceName":"image1.jpg","tiff:BitsPerSample":"8","tiff:ImageLength":"169","tiff:ImageWidth":"752","tika.mime.file":"image1.jpg"
{noformat}

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [VOTE] Apache Tika 1.7 Release

2015-01-06 Thread Allison, Timothy B.
-1

I'm sorry that I haven't had a chance to kick the tires on the recent changes 
to the metadata extraction from images until now, but it looks like 1.7-rc2 and 
trunk are not pulling metadata from embedded images.

I've posted a test file from govdocs1 to TIKA-1445.  I may have time tomorrow 
to see what's going on.  I should also have time tomorrow to finish the 
analysis of the comparison between 1.6 and 1.7 on govdocs1.

Sorry for my delay, all!  And even greater apologies if user error is at fault 
and metadata is successfully being extracted from embedded images. :)

Thank you, Tyler, for running this release!


-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Tuesday, January 06, 2015 11:36 AM
To: dev@tika.apache.org
Subject: Re: [VOTE] Apache Tika 1.7 Release

On Tue, 6 Jan 2015, Tyler Palsulich wrote:
> A candidate for the Tika 1.7 release is available at:
>https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/
>
> The SHA1 checksum of the archive is
>0307a8367ae6f8b1103824fd11337fd89e24e6a4.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/

Looks good to me, I'm +1

Nick


[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267101#comment-14267101
 ] 

Tim Allison edited comment on TIKA-1445 at 1/7/15 1:13 AM:
---

I'm sorry that I haven't had a chance to kick the tires on the fix for this 
issue.  This may be a case of user error, perhaps I have to twiddle with the 
parser config file?

I found that the current fix (with default configuration) is not pulling 
metadata from embedded image files in tika-trunk or tika-1.7-rc2.

Test doc from govdocs1 attached.

We should be extracting these values (at least) in the embedded tiff:

{noformat}
"Data Precision":"8 bits","Image Height":"169 pixels","Image Width":"752 
pixels","Number of Components":"3","Resolution Units":"inch","X 
Resolution":"300 dots","Y Resolution":"300 
dots","resourceName":"image1.jpg","tiff:BitsPerSample":"8","tiff:ImageLength":"169","tiff:ImageWidth":"752","tika.mime.file":"image1.jpg"
{noformat}




was (Author: talli...@mitre.org):
I'm sorry that I haven't had a chance to kick the tires on the fix for this 
issue.

I just discovered that the current fix is not pulling metadata from embedded 
image files in tika-trunk or tika-1.7-rc2.

Test doc from govdocs1 attached.

We should be extracting these values (at least) in the embedded tiff:

{noformat}
"Data Precision":"8 bits","Image Height":"169 pixels","Image Width":"752 
pixels","Number of Components":"3","Resolution Units":"inch","X 
Resolution":"300 dots","Y Resolution":"300 
dots","resourceName":"image1.jpg","tiff:BitsPerSample":"8","tiff:ImageLength":"169","tiff:ImageWidth":"752","tika.mime.file":"image1.jpg"
{noformat}

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267161#comment-14267161
 ] 

Tim Allison commented on TIKA-1445:
---

Looking into this a bit more...we aren't even getting metadata out of regular 
images, for example, our testJPEG.jpg from tika-parser's test-documents yields 
no useful metadata with trunk, it looks like this isn't even being touched by 
the TesseractOCRParser:

{noformat}
Content-Length: 7686
Content-Type: image/jpeg
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: testJPEG.jpg
{noformat}

Again, my apologies if I need to make modifications to our config...

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Attachment: TIKA-1445_20150106_tallison.patch

There were two problems:

1) This aborted before parsing the metadata if there is no Tesseract installed

{noformat}
if (!ExternalParser.check(checkCmd))
 return;
{noformat}

2) The call to getSupportedTypes in the _TMP_X_PARSERs always returned false 
because of a conflict of class types.

If this modification looks ok, I'll add a few more test cases and commit it.

Side note:  In working on this I realized that both the ImageParser and the 
JpegParser support jpegs. On some files, one parser returns more info than the 
other and vice versa...another case of competing parsers! :)

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)