[jira] [Commented] (TIKA-2261) TikaOcr giving different result across platforms
[ https://issues.apache.org/jira/browse/TIKA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858992#comment-15858992 ] Sandeepan commented on TIKA-2261: - [~talli...@mitre.org] where do i find rotation.py. Can you please point me to the pypi location. Not able to figure out which one. > TikaOcr giving different result across platforms > > > Key: TIKA-2261 > URL: https://issues.apache.org/jira/browse/TIKA-2261 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Sandeepan > Attachments: 4.png > > > Hi, > I am using Tika to parse every type of file and it works great for non image > files. > My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command > line, i get the same result on both the platforms. > Example Command > tesseract 3.jpg ouput -l eng -psm 1 txt > But when I use it through Java code, it gives me very different results and > the quality is worse in case of ubuntu. > Sample Code > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler handler = new BodyContentHandler(-1); > Metadata metadata = new Metadata(); > FileInputStream in = new FileInputStream(path); > parser.parse(in, handler, metadata); > parsedText = handler.toString(); > On Mac : > ++ > $ tesseract -v > tesseract 3.04.01 > leptonica-1.74.1 > libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 > On Ubuntu > ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v > tesseract 3.04.01 > leptonica-1.74.1 > libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 > Not able to figure out what the issue is. \ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-2261) TikaOcr giving different result across platforms
[ https://issues.apache.org/jira/browse/TIKA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858988#comment-15858988 ] Sandeepan commented on TIKA-2261: - [~talli...@mitre.org] I'll try that. Thanks. [~lfcnassif] Actually tesseract is in path and Autodetect parser is returning Ocred data but it's slightly different than the one on mac. I've attached a image for which I ran the code. Pasting Only first two lines to keep it short. Mac [-~-] With the sorrow of living so great, the sorrow of punishment had to be piti- less. We lived for the day and died for it. === Ubuntu === WiLh Lhe somw of living so greal. Lhe sorrow of punishmem had to he pikir less. We lived fox Lhe day and died fox ir > TikaOcr giving different result across platforms > > > Key: TIKA-2261 > URL: https://issues.apache.org/jira/browse/TIKA-2261 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Sandeepan > Attachments: 4.png > > > Hi, > I am using Tika to parse every type of file and it works great for non image > files. > My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command > line, i get the same result on both the platforms. > Example Command > tesseract 3.jpg ouput -l eng -psm 1 txt > But when I use it through Java code, it gives me very different results and > the quality is worse in case of ubuntu. > Sample Code > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler handler = new BodyContentHandler(-1); > Metadata metadata = new Metadata(); > FileInputStream in = new FileInputStream(path); > parser.parse(in, handler, metadata); > parsedText = handler.toString(); > On Mac : > ++ > $ tesseract -v > tesseract 3.04.01 > leptonica-1.74.1 > libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 > On Ubuntu > ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v > tesseract 3.04.01 > leptonica-1.74.1 > libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 > Not able to figure out what the issue is. \ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (TIKA-2261) TikaOcr giving different result across platforms
[ https://issues.apache.org/jira/browse/TIKA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeepan updated TIKA-2261: Attachment: 4.png This file's output on Mac vs Ubuntu Only first two lines. Mac [-~-] With the sorrow of living so great, the sorrow of punishment had to be piti- less. We lived for the day and died for it. === Ubuntu === WiLh Lhe somw of living so greal. Lhe sorrow of punishmem had to he pikir less. We lived fox Lhe day and died fox ir > TikaOcr giving different result across platforms > > > Key: TIKA-2261 > URL: https://issues.apache.org/jira/browse/TIKA-2261 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Sandeepan > Attachments: 4.png > > > Hi, > I am using Tika to parse every type of file and it works great for non image > files. > My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command > line, i get the same result on both the platforms. > Example Command > tesseract 3.jpg ouput -l eng -psm 1 txt > But when I use it through Java code, it gives me very different results and > the quality is worse in case of ubuntu. > Sample Code > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler handler = new BodyContentHandler(-1); > Metadata metadata = new Metadata(); > FileInputStream in = new FileInputStream(path); > parser.parse(in, handler, metadata); > parsedText = handler.toString(); > On Mac : > ++ > $ tesseract -v > tesseract 3.04.01 > leptonica-1.74.1 > libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 > On Ubuntu > ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v > tesseract 3.04.01 > leptonica-1.74.1 > libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 > Not able to figure out what the issue is. \ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (TIKA-2261) TikaOcr giving different result across platforms
Sandeepan created TIKA-2261: --- Summary: TikaOcr giving different result across platforms Key: TIKA-2261 URL: https://issues.apache.org/jira/browse/TIKA-2261 Project: Tika Issue Type: Bug Affects Versions: 1.14 Reporter: Sandeepan Hi, I am using Tika to parse every type of file and it works great for non image files. My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command line, i get the same result on both the platforms. Example Command tesseract 3.jpg ouput -l eng -psm 1 txt But when I use it through Java code, it gives me very different results and the quality is worse in case of ubuntu. Sample Code AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); FileInputStream in = new FileInputStream(path); parser.parse(in, handler, metadata); parsedText = handler.toString(); On Mac : ++ $ tesseract -v tesseract 3.04.01 leptonica-1.74.1 libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 On Ubuntu ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v tesseract 3.04.01 leptonica-1.74.1 libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8 Not able to figure out what the issue is. \ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858006#comment-15858006 ] Sandeepan commented on TIKA-1422: - [~thaichat04] I am also getting different result when using Tesseract through tike across Mac/Ubuntu. From command line, it give same result on both the platforms. Were you able to find the reason? > org.apache.tika.parser.mail.RFC822ParserTest fails > -- > > Key: TIKA-1422 > URL: https://issues.apache.org/jira/browse/TIKA-1422 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.7 > > Attachments: TIKA-1422.Mattmann.100114.patch.txt, > TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, > TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch > > > I'm seeing test failures from: > {noformat} > Results : > Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): > (..) > Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 > {noformat} > CentOS6 VM image, running: > {noformat} > [mattmann@memex tika]$ java -version > java version "1.7.0_67" > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > [mattmann@memex tika]$ mvn -version > Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; > 2014-02-14T09:37:52-08:00) > Maven home: /usr/share/apache-maven > Java version: 1.7.0_65, vendor: Oracle Corporation > Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: > "amd64", family: "unix" > [mattmann@memex tika]$ > {noformat} > Here are the surefire reports - no clue what's up here: > {noformat} > [mattmann@memex tika]$ more > tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt > > --- > Test set: org.apache.tika.parser.mail.RFC822ParserTest > --- > Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< > FAILURE! > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.152 sec <<< FAILURE! > org.mockito.exceptions.verification.TooManyActualInvocations: > xHTMLContentHandler.startElement( > "http://www.w3.org/1999/xhtml";, > "div", > "div", > isA(org.xml.sax.Attributes) > ); > Wanted 4 times but was 5 > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) > Caused by: org.mockito.exceptions.cause.UndesiredInvocation: > Undesired invocation: > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) > at > org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.Native