Re: OCR with tika-server

2014-10-03 Thread Mattmann, Chris A (3980)
Kevin glad it is now fixed with you!

If you get a chance, please feel free to document
this on the wiki:

https://wiki.apache.org/tika/TikaOCR


You can sign up for an account, and then I can grant
you permissions to edit the file. Let me know!

Cheers,
Chris



++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: kevin slote 
Reply-To: "dev@tika.apache.org" 
Date: Friday, October 3, 2014 at 4:10 PM
To: "dev@tika.apache.org" 
Subject: Re: OCR with tika-server

>Hi all,
>
>I just confirmed that the problem was that my version of tesseract was too
>old.
>Maybe it would be a good idea to put something in the canRun method at the
>top of the tesseract unit test to also check that the version of tesseract
>is relevant?
>
>Older versions of tesseract do not have a "-v" or "--version" flag.  So
>maybe use ProcessBuilder to run that command and parse the string to see
>if
>it returned an error?
>
>Thanks for everyone's help.
>
>On Fri, Oct 3, 2014 at 2:30 PM, kevin slote  wrote:
>
>> Thanks for following up!
>>
>> I was trying to dig deeper before I responded.
>>
>> Tyler,
>>
>> I followed those instructions.  My version of Tesseract does not ocr the
>> google logo because it is not a tiff.  I used imagemagick to convert it
>>to
>> a tif and tesseract returned "check_legal_image_size:Error:Only
>>1,2,4,5,6,8
>> bpp are supported:32" error which usually means it needs to be re-sized
>> with imagemagick.
>>
>>
>> Chris,
>>
>> I wrote a python wrapper for tesseract that can parse the documents that
>> were in your test-document repository concerning OCR (testOCR.pdf,
>>etc.) It
>> looks like right now, in TesseractOCRParser.java, the command line
>>argument
>> that is passed to the os points to a .tmp file in /tmp/.
>>
>> So the command that is executed is
>>
>>"tesseract /tmp/apache-tika-2409864150710514587.tmp
>> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>>
>> This is not working for me.  When I grab those .tmp files and try to ocr
>> them from the command line, tesseract gets thrown for a loop.
>>
>> From what I can tell, is the tesseract I have installed can only handle
>> .tif files.
>> I can back this up by citing the tesseract page:
>> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>>
>>  If Tesseract isn't available for your distribution, or you want to use
>>a
>> newer version than they offer, you can compile your own
>> . Note that
>>older
>> versions of Tesseract only supported processing .tiff files.
>>
>> So, I think that upgrading tesseract or moving to ubuntu 12 or higher
>>will
>> solve my problems.
>>
>> I will let the listserv know if that fixes it.
>>
>>
>> Kevin Slote
>>
>>
>>
>> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>>
>>> What type of image is it, Kevin?
>>>
>>> If it’s a TIFF, you need to install tesseract with special lib tiff
>>> parameters. See:
>>>
>>> https://gist.github.com/henrik/1967035
>>>
>>>
>>> Can you parse the image file with tesseract by itself, without
>>> Tika’s tmp image?
>>>
>>> ++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -Original Message-
>>> From: , "Paul M   (398J)" 
>>> Reply-To: "dev@tika.apache.org" 
>>> Date: Wednesday, October 1, 2014 at 1:47 PM
>>> To: "" 
>>> Subject: Re: OCR with tika-server
>>>
>>> >Nothing to be embarrassed about at all Kevin. I actually thought
>>>maybe it
>>> >was just a typo issue and I randomly happen to catch that. I've
>>> >definitely done that one before myself.
>>> >
>>> >Bummed that was not the problem.
>>> >
>>> >--Paul
>>> >
>>> >On Oct 1, 2014, at 1:00 PM, kevin slote 
>>> > wrote:
>>> >
>>> >> What I wrote there did have a typo in it. (It's not every day you
>>>get
>>> to
>>> >> embarrass yourself in front of a bunch of guys from NASA)
>>> 

[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158912#comment-14158912
 ] 

Hudson commented on TIKA-1369:
--

UNSTABLE: Integrated in tika-trunk-jdk1.6 #224 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/224/])
Fix for TIKA-1369: Resolve thread safety issue in ImageMetadataExtractor. 
Contributed by Vilmos Papp . This closes #15. 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1629347)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java


> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 1.7
>
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158904#comment-14158904
 ] 

Hudson commented on TIKA-1369:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #245 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/245/])
Fix for TIKA-1369: Resolve thread safety issue in ImageMetadataExtractor. 
Contributed by Vilmos Papp . This closes #15. 
(mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1629347)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java


> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 1.7
>
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1369.
-
   Resolution: Fixed
Fix Version/s: 1.7
 Assignee: Chris A. Mattmann

merged in r1629347.

> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Assignee: Chris A. Mattmann
>Priority: Critical
> Fix For: 1.7
>
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158895#comment-14158895
 ] 

Chris A. Mattmann commented on TIKA-1369:
-

I merged the pull request in r1629347. Thanks [~papgyo]! If [~rgauss] sees 
other things to merge, or update, please go for it, just trying to close out 
our Github issues and thank people for their contributions!

> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Priority: Critical
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158893#comment-14158893
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/15


> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Priority: Critical
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1369 Resolve thread safety issue in ImageM...

2014-10-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/15


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158821#comment-14158821
 ] 

Hudson commented on TIKA-1354:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #244 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/244/])
Fix for TIKA-1354 Register ForkParser Service in Activator. Contributed by 
Michal Hlavac . This closes #13. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1629339)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/internal/Activator.java


> ForkParser doesn't work in OSGI container
> -
>
> Key: TIKA-1354
> URL: https://issues.apache.org/jira/browse/TIKA-1354
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Michal Hlavac
>
> I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158822#comment-14158822
 ] 

Hudson commented on TIKA-1435:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #244 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/244/])
Fix for TIKA-1435: Upgrade Rome to 1.5 contributed by Johannes Mockenhaupt 
. This closes #16. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1629338)
* /tika/trunk/tika-parsers/pom.xml
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java


> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158754#comment-14158754
 ] 

Hudson commented on TIKA-1435:
--

UNSTABLE: Integrated in tika-trunk-jdk1.6 #222 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/222/])
Fix for TIKA-1435: Upgrade Rome to 1.5 contributed by Johannes Mockenhaupt 
. This closes #16. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1629338)
* /tika/trunk/tika-parsers/pom.xml
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java


> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158753#comment-14158753
 ] 

Hudson commented on TIKA-1354:
--

UNSTABLE: Integrated in tika-trunk-jdk1.6 #222 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/222/])
Fix for TIKA-1354 Register ForkParser Service in Activator. Contributed by 
Michal Hlavac . This closes #13. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1629339)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/internal/Activator.java


> ForkParser doesn't work in OSGI container
> -
>
> Key: TIKA-1354
> URL: https://issues.apache.org/jira/browse/TIKA-1354
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Michal Hlavac
>
> I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1126) text/html procuder for tika-server

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158743#comment-14158743
 ] 

ASF GitHub Bot commented on TIKA-1126:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/3


> text/html procuder for tika-server
> --
>
> Key: TIKA-1126
> URL: https://issues.apache.org/jira/browse/TIKA-1126
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.4
>Reporter: Ali Mosavian
>Priority: Trivial
> Fix For: 1.4
>
> Attachments: tika_server_html_output.patch
>
>
> the /tika resource handler of tika-server can only produce text/plain. This 
> patch adds support for producing text/html.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: Similar to TIKA-1126, this commit adds the abil...

2014-10-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-10-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158737#comment-14158737
 ] 

Chris A. Mattmann commented on TIKA-1354:
-

Patch merged and committed to trunk in r1629339. Thank you [~hlavki]!

> ForkParser doesn't work in OSGI container
> -
>
> Key: TIKA-1354
> URL: https://issues.apache.org/jira/browse/TIKA-1354
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Michal Hlavac
>
> I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158734#comment-14158734
 ] 

ASF GitHub Bot commented on TIKA-1354:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/13


> ForkParser doesn't work in OSGI container
> -
>
> Key: TIKA-1354
> URL: https://issues.apache.org/jira/browse/TIKA-1354
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Michal Hlavac
>
> I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: [TIKA-1354] Register ForkParser service in Acti...

2014-10-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/13


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1435.
-
Resolution: Fixed

- fixed in r1629338. Thanks Johannes Mockenhaupt !

> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158724#comment-14158724
 ] 

ASF GitHub Bot commented on TIKA-1435:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/16


> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1435: Upgrade Rome to 1.5

2014-10-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/16


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: OCR with tika-server

2014-10-03 Thread kevin slote
Hi all,

I just confirmed that the problem was that my version of tesseract was too
old.
Maybe it would be a good idea to put something in the canRun method at the
top of the tesseract unit test to also check that the version of tesseract
is relevant?

Older versions of tesseract do not have a "-v" or "--version" flag.  So
maybe use ProcessBuilder to run that command and parse the string to see if
it returned an error?

Thanks for everyone's help.

On Fri, Oct 3, 2014 at 2:30 PM, kevin slote  wrote:

> Thanks for following up!
>
> I was trying to dig deeper before I responded.
>
> Tyler,
>
> I followed those instructions.  My version of Tesseract does not ocr the
> google logo because it is not a tiff.  I used imagemagick to convert it to
> a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
> bpp are supported:32" error which usually means it needs to be re-sized
> with imagemagick.
>
>
> Chris,
>
> I wrote a python wrapper for tesseract that can parse the documents that
> were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
> looks like right now, in TesseractOCRParser.java, the command line argument
> that is passed to the os points to a .tmp file in /tmp/.
>
> So the command that is executed is
>
>"tesseract /tmp/apache-tika-2409864150710514587.tmp
> /tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"
>
> This is not working for me.  When I grab those .tmp files and try to ocr
> them from the command line, tesseract gets thrown for a loop.
>
> From what I can tell, is the tesseract I have installed can only handle
> .tif files.
> I can back this up by citing the tesseract page:
> https://code.google.com/p/tesseract-ocr/wiki/ReadMe
>
>  If Tesseract isn't available for your distribution, or you want to use a
> newer version than they offer, you can compile your own
> . Note that  older
> versions of Tesseract only supported processing .tiff files.
>
> So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
> solve my problems.
>
> I will let the listserv know if that fixes it.
>
>
> Kevin Slote
>
>
>
> On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> What type of image is it, Kevin?
>>
>> If it’s a TIFF, you need to install tesseract with special lib tiff
>> parameters. See:
>>
>> https://gist.github.com/henrik/1967035
>>
>>
>> Can you parse the image file with tesseract by itself, without
>> Tika’s tmp image?
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: , "Paul M   (398J)" 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Wednesday, October 1, 2014 at 1:47 PM
>> To: "" 
>> Subject: Re: OCR with tika-server
>>
>> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>> >was just a typo issue and I randomly happen to catch that. I've
>> >definitely done that one before myself.
>> >
>> >Bummed that was not the problem.
>> >
>> >--Paul
>> >
>> >On Oct 1, 2014, at 1:00 PM, kevin slote 
>> > wrote:
>> >
>> >> What I wrote there did have a typo in it. (It's not every day you get
>> to
>> >> embarrass yourself in front of a bunch of guys from NASA)
>> >>
>> >> But that was not what I had in my terminal when I tested it.
>> >>
>> >>
>> >>
>> >> The actual PATH was:
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
>> >>ames:/usr/bin/tesseract"
>> >>
>> >>
>> >>
>> >> I think what was actually wrong with the path is that I added the
>> entire
>> >> path to the tesseract executable, which was in my /usr/bin/ directory,
>> >> instead of just the directory where tesseract lives.  Is this true?
>> >>
>> >>
>> >>
>> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
>> >>printed
>> >> config.getTesseractPath() to stdout.  This field was empty.
>> >>
>> >> However, I have tesseract installed system wide on this ubuntu vm.
>> >>
>> >> So the canRun method evaluated as true whether or not the tesseractPath
>> >>was
>> >> configured correctly.
>> >>
>> >>
>> >>
>> >> I have been slowly trying to debug this all day.  It looks like tika is
>> >> making a tmp file with the .tmp preffix.
>> >>
>> >> I commented out some of the code to so that they remained in /tmp/.
>> >>
>> >>
>> >>
>> >> It looks like tesseract doesn't like that

Re: OCR with tika-server

2014-10-03 Thread kevin slote
Thanks for following up!

I was trying to dig deeper before I responded.

Tyler,

I followed those instructions.  My version of Tesseract does not ocr the
google logo because it is not a tiff.  I used imagemagick to convert it to
a tif and tesseract returned "check_legal_image_size:Error:Only 1,2,4,5,6,8
bpp are supported:32" error which usually means it needs to be re-sized
with imagemagick.


Chris,

I wrote a python wrapper for tesseract that can parse the documents that
were in your test-document repository concerning OCR (testOCR.pdf, etc.) It
looks like right now, in TesseractOCRParser.java, the command line argument
that is passed to the os points to a .tmp file in /tmp/.

So the command that is executed is

   "tesseract /tmp/apache-tika-2409864150710514587.tmp
/tmp/apache-tika-1277985370508249503.tmp -l eng -psm 1"

This is not working for me.  When I grab those .tmp files and try to ocr
them from the command line, tesseract gets thrown for a loop.

>From what I can tell, is the tesseract I have installed can only handle
.tif files.
I can back this up by citing the tesseract page:
https://code.google.com/p/tesseract-ocr/wiki/ReadMe

 If Tesseract isn't available for your distribution, or you want to use a
newer version than they offer, you can compile your own
. Note that  older
versions of Tesseract only supported processing .tiff files.

So, I think that upgrading tesseract or moving to ubuntu 12 or higher will
solve my problems.

I will let the listserv know if that fixes it.


Kevin Slote



On Wed, Oct 1, 2014 at 5:13 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> What type of image is it, Kevin?
>
> If it’s a TIFF, you need to install tesseract with special lib tiff
> parameters. See:
>
> https://gist.github.com/henrik/1967035
>
>
> Can you parse the image file with tesseract by itself, without
> Tika’s tmp image?
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , "Paul M   (398J)" 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, October 1, 2014 at 1:47 PM
> To: "" 
> Subject: Re: OCR with tika-server
>
> >Nothing to be embarrassed about at all Kevin. I actually thought maybe it
> >was just a typo issue and I randomly happen to catch that. I've
> >definitely done that one before myself.
> >
> >Bummed that was not the problem.
> >
> >--Paul
> >
> >On Oct 1, 2014, at 1:00 PM, kevin slote 
> > wrote:
> >
> >> What I wrote there did have a typo in it. (It's not every day you get to
> >> embarrass yourself in front of a bunch of guys from NASA)
> >>
> >> But that was not what I had in my terminal when I tested it.
> >>
> >>
> >>
> >> The actual PATH was:
> >>
> >>
> >>
> >>
> >>
> >>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/g
> >>ames:/usr/bin/tesseract"
> >>
> >>
> >>
> >> I think what was actually wrong with the path is that I added the entire
> >> path to the tesseract executable, which was in my /usr/bin/ directory,
> >> instead of just the directory where tesseract lives.  Is this true?
> >>
> >>
> >>
> >> I deleted the hard coding from the TesseractOCRConfig.jave and then
> >>printed
> >> config.getTesseractPath() to stdout.  This field was empty.
> >>
> >> However, I have tesseract installed system wide on this ubuntu vm.
> >>
> >> So the canRun method evaluated as true whether or not the tesseractPath
> >>was
> >> configured correctly.
> >>
> >>
> >>
> >> I have been slowly trying to debug this all day.  It looks like tika is
> >> making a tmp file with the .tmp preffix.
> >>
> >> I commented out some of the code to so that they remained in /tmp/.
> >>
> >>
> >>
> >> It looks like tesseract doesn't like that.
> >>
> >> I tried to ocr these .tmp files to see if I could isolate what was going
> >> wrong for me.
> >>
> >>
> >>
> >> kslote@ubuntu:~/tika/tika$ tesseract
> >> /tmp/apache-tika-7112319184053570698.tmp out
> >>
> >> Tesseract Open Source OCR Engine
> >>
> >> name_to_image_type:Error:Unrecognized image
> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> IMAGE::read_header:Error:Can't read this image
> >> type:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> tesseract:Error:Read of file
> >>failed:/tmp/apache-tika-7112319184053570698.tmp
> >>
> >> Segmentation fault
> >>
> >>
> >>
> >> On the wiki it mentions something about getting tesseract to work with
> >> .tiff

Re: OCR with tika-server

2014-10-03 Thread Mattmann, Chris A (3980)
Hi Kevin just checking back - did you get it working?

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , Chris Mattmann 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, October 1, 2014 at 2:13 PM
To: "dev@tika.apache.org" 
Subject: Re: OCR with tika-server

>What type of image is it, Kevin?
>
>If it’s a TIFF, you need to install tesseract with special lib tiff
>parameters. See:
>
>https://gist.github.com/henrik/1967035
>
>
>Can you parse the image file with tesseract by itself, without
>Tika’s tmp image?
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: , "Paul M   (398J)" 
>Reply-To: "dev@tika.apache.org" 
>Date: Wednesday, October 1, 2014 at 1:47 PM
>To: "" 
>Subject: Re: OCR with tika-server
>
>>Nothing to be embarrassed about at all Kevin. I actually thought maybe it
>>was just a typo issue and I randomly happen to catch that. I've
>>definitely done that one before myself.
>>
>>Bummed that was not the problem.
>>
>>--Paul
>>
>>On Oct 1, 2014, at 1:00 PM, kevin slote 
>> wrote:
>>
>>> What I wrote there did have a typo in it. (It's not every day you get
>>>to
>>> embarrass yourself in front of a bunch of guys from NASA)
>>> 
>>> But that was not what I had in my terminal when I tested it.
>>> 
>>> 
>>> 
>>> The actual PATH was:
>>> 
>>> 
>>> 
>>> 
>>> 
>>>"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
>>>g
>>>ames:/usr/bin/tesseract"
>>> 
>>> 
>>> 
>>> I think what was actually wrong with the path is that I added the
>>>entire
>>> path to the tesseract executable, which was in my /usr/bin/ directory,
>>> instead of just the directory where tesseract lives.  Is this true?
>>> 
>>> 
>>> 
>>> I deleted the hard coding from the TesseractOCRConfig.jave and then
>>>printed
>>> config.getTesseractPath() to stdout.  This field was empty.
>>> 
>>> However, I have tesseract installed system wide on this ubuntu vm.
>>> 
>>> So the canRun method evaluated as true whether or not the tesseractPath
>>>was
>>> configured correctly.
>>> 
>>> 
>>> 
>>> I have been slowly trying to debug this all day.  It looks like tika is
>>> making a tmp file with the .tmp preffix.
>>> 
>>> I commented out some of the code to so that they remained in /tmp/.
>>> 
>>> 
>>> 
>>> It looks like tesseract doesn't like that.
>>> 
>>> I tried to ocr these .tmp files to see if I could isolate what was
>>>going
>>> wrong for me.
>>> 
>>> 
>>> 
>>> kslote@ubuntu:~/tika/tika$ tesseract
>>> /tmp/apache-tika-7112319184053570698.tmp out
>>> 
>>> Tesseract Open Source OCR Engine
>>> 
>>> name_to_image_type:Error:Unrecognized image
>>> type:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> IMAGE::read_header:Error:Can't read this image
>>> type:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> tesseract:Error:Read of file
>>>failed:/tmp/apache-tika-7112319184053570698.tmp
>>> 
>>> Segmentation fault
>>> 
>>> 
>>> 
>>> On the wiki it mentions something about getting tesseract to work with
>>> .tiff files.  For whatever reason, the tesseract I have installed only
>>> works for .tiff files.  Would it be recommend that I re install
>>>tesseract
>>> from the source?
>>> 
>>> On Tue, Sep 30, 2014 at 7:28 PM, Ramirez, Paul M (398J) <
>>> paul.m.rami...@jpl.nasa.gov> wrote:
>>> 
 Is that a typo in your path to tesseract?
 
 /urs/bin/tesseract => /usr/bin/tesseract
 
 --Paul
 
> On Sep 30, 2014, at 1:48 PM, "kevin slote"  wrote:
> 
> Unfortunately, that did not do it either.
> 
> I did:
> 
>  $export
> 
 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/
g
ames:/urs/bin/tesseract
> 
> Here is the output from printenv
> 
> kslote@ubuntu:~/tika/tika$ printenv
> SHELL=/bin/bash
> USERNAME=kslote
> XDG_CONFIG_DIRS=/etc/xdg/xdg-gnome:/etc/xdg
> DESKTOP_SESSION=gnome
> 
 
>>

[jira] [Comment Edited] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158223#comment-14158223
 ] 

Tim Allison edited comment on TIKA-1427 at 10/3/14 5:33 PM:


On at least one test doc, I'm getting correct behavior:
{noformat}
http://www.w3.org/1999/xhtml";>


...


 
What is a generic drug?
...
generic drugs. 



...
Generic Drugs: Safe. Effective. FDA Approved.


Local Disk
Generic Drugs



{noformat}

Can you attach an example of a file that is failing? 


was (Author: talli...@mitre.org):
On at least one test doc, I'm getting correct behavior:
{noformat}
http://www.w3.org/1999/xhtml";>


...


 
What is a generic drug?
...
generic drugs. 



...
Generic Drugs: Safe. Effective. FDA Approved.


Local Disk
Generic Drugs





Can you attach an example of a file that is failing? 

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158223#comment-14158223
 ] 

Tim Allison commented on TIKA-1427:
---

On at least one test doc, I'm getting correct behavior:
{noformat}
http://www.w3.org/1999/xhtml";>


...


 
What is a generic drug?
...
generic drugs. 



...
Generic Drugs: Safe. Effective. FDA Approved.


Local Disk
Generic Drugs





Can you attach an example of a file that is failing? 

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158170#comment-14158170
 ] 

Tim Allison commented on TIKA-1427:
---

We're currently iterating through the images once we hit the bottom of the 
page.  I don't have the skill/knowledge/time to do the math to figure out where 
the images are in relationship to the text.  Sorry!  If there's example code 
somewhere, let us know.

The single img tag is an issue, tho.  Let me take a look.

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread James Baker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157932#comment-14157932
 ] 

James Baker commented on TIKA-1427:
---

Also, it looks like it only extracts one image per page? At least, it's only 
putting one image tag at the bottom of each page if there are multiple images 
on a page.

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread James Baker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157916#comment-14157916
 ] 

James Baker edited comment on TIKA-1427 at 10/3/14 11:58 AM:
-

Thanks for your work on this Tim. Image extraction is working and  tags 
are being inserted into the structured view, but it is inserting them at the 
bottom of each page. Is it not possible to have them inserted at the correct 
location within the document?


was (Author: james.d.baker):
Image extraction is working and  tags are being inserted into the 
structured view, but it is inserting them at the bottom of each page. Is it not 
possible to have them inserted at the correct location within the document?

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread James Baker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157916#comment-14157916
 ] 

James Baker commented on TIKA-1427:
---

Image extraction is working and  tags are being inserted into the 
structured view, but it is inserting them at the bottom of each page. Is it not 
possible to have them inserted at the correct location within the document?

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread Johannes Mockenhaupt (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157838#comment-14157838
 ] 

Johannes Mockenhaupt commented on TIKA-1435:


PR: https://github.com/apache/tika/pull/16

> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread Johannes Mockenhaupt (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Mockenhaupt updated TIKA-1435:
---
Comment: was deleted

(was: PR: https://github.com/apache/tika/pull/16)

> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1435: Upgrade Rome to 1.5

2014-10-03 Thread jotomo
GitHub user jotomo opened a pull request:

https://github.com/apache/tika/pull/16

TIKA-1435: Upgrade Rome to 1.5

Adopt new namespace and enjoy generics.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jotomo/tika rome-1.5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/16.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16


commit b6e3a51be79efc04fdd643378f67b2f7d3bc5af4
Author: Johannes Mockenhaupt 
Date:   2014-10-02T22:17:55Z

TIKA-1435: Upgrade Rome to 1.5

Adopt new namespace and enjoy generics.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157837#comment-14157837
 ] 

ASF GitHub Bot commented on TIKA-1435:
--

GitHub user jotomo opened a pull request:

https://github.com/apache/tika/pull/16

TIKA-1435: Upgrade Rome to 1.5

Adopt new namespace and enjoy generics.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jotomo/tika rome-1.5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/16.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16


commit b6e3a51be79efc04fdd643378f67b2f7d3bc5af4
Author: Johannes Mockenhaupt 
Date:   2014-10-02T22:17:55Z

TIKA-1435: Upgrade Rome to 1.5

Adopt new namespace and enjoy generics.




> Update rome dependency to 1.5
> -
>
> Key: TIKA-1435
> URL: https://issues.apache.org/jira/browse/TIKA-1435
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Johannes Mockenhaupt
>Priority: Minor
> Fix For: 1.7
>
>
> Rome 1.5 has been released to Sonatype 
> (https://github.com/rometools/rome/issues/183). Though the website 
> (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
> is mostly maintenance, adopting slf4j and generics as well as moving the 
> namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1435) Update rome dependency to 1.5

2014-10-03 Thread Johannes Mockenhaupt (JIRA)
Johannes Mockenhaupt created TIKA-1435:
--

 Summary: Update rome dependency to 1.5
 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Priority: Minor
 Fix For: 1.7


Rome 1.5 has been released to Sonatype 
(https://github.com/rometools/rome/issues/183). Though the website 
(http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
is mostly maintenance, adopting slf4j and generics as well as moving the 
namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)