[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Priority: Blocker  (was: Major)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Attachment: 03.doc

I'm sorry that I haven't had a chance to kick the tires on the fix for this 
issue.

I just discovered that the current fix is not pulling metadata from embedded 
image files in tika-trunk or tika-1.7-rc2.

Test doc from govdocs1 attached.

We should be extracting these values (at least) in the embedded tiff:

{noformat}
Data Precision:8 bits,Image Height:169 pixels,Image Width:752 
pixels,Number of Components:3,Resolution Units:inch,X 
Resolution:300 dots,Y Resolution:300 
dots,resourceName:image1.jpg,tiff:BitsPerSample:8,tiff:ImageLength:169,tiff:ImageWidth:752,tika.mime.file:image1.jpg
{noformat}

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-29 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Attachment: TIKA-1445_tallison_v3_20141027.patch

This version subclasses Parser to create an ImageMetaParser class, which our 
current image metadata parsers then extend.

This adds a DefaultImageMetadataparser that is a copy and paste of 
DefaultParser...can't override static loader unfortunately!

We now specify regular parsers in the Parser services file and 
ImageMetadataParsers in a separate services file.

I don't like that this creates a new class of parsers, but I can't think of 
another way of guaranteeing that the OCRParser will find an image metadata 
parser correctly.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-27 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
--
Attachment: TIKA-1445_tallison_20141027.patch.txt

Something along these lines?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-26 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1445:
--
Attachment: TIKA-1445.Palsulich.102614.patch

Here is an updated patch with the above idea. I created a new public method in 
CompositeParser and DefaultParser -- {{getAllParsersFor(ParseContext, 
MediaType}} -- which returns a list of all Parsers that support the given type. 
This list is then searched from TesseractOCRParser for a second Parser for the 
image being parsed.

I created a dummy BodyContentHandler to drop all content from the second Parser.

Thoughts?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-24 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1445:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-12 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1445:

Attachment: TIKA-1445.Mattmann.101214.patch.txt

- thinking about this, this may not be the right solution since each parser may 
consume the InputStream when generating metadata - may need to think about this 
more, but here's where I left off.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1445.Mattmann.101214.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)