[jira] [Closed] (TIKA-2347) Underlined text is not decorated as such when extracting from word documents

2019-04-04 Thread Konstantin Gribov (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2347. --- > Underlined text is not decorated as such when extracting from word documents >

[jira] [Resolved] (TIKA-2601) Invalid XHTML output for some WORD documents

2019-04-04 Thread Konstantin Gribov (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2601. - Resolution: Duplicate I mark it as duplicate for TIKA-2555 which I'm currently looking

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810288#comment-16810288 ] Tim Allison commented on TIKA-2847: --- Last hope: {noformat} PDFParserConfig pdfParserConfig = new

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810256#comment-16810256 ] Tim Allison commented on TIKA-2749: --- [~rossj], this is very helpful...any recs on how to detect "not a

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Ross Johnson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810172#comment-16810172 ] Ross Johnson commented on TIKA-2749: OCRing the inlined images directly can be tricky, in my

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810165#comment-16810165 ] Ashish Tiwari commented on TIKA-2847: - yes TikaInputStream.get(infile) gave me same error. >

[jira] [Assigned] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.

2019-04-04 Thread Konstantin Gribov (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reassigned TIKA-2555: --- Assignee: Konstantin Gribov > Text with [underline] + [another format] in word

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:47 PM: --- There are several

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810150#comment-16810150 ] Tim Allison commented on TIKA-2847: --- sorry.  I meant {{TikaInputStream.get(infile)}}... >

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810146#comment-16810146 ] Ashish Tiwari commented on TIKA-2847: - Thanks Tim setting "setUseSAXDocxExtractor" to true worked for

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810131#comment-16810131 ] Tim Allison commented on TIKA-2847: --- Try opening the InputStream with

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:13 PM: --- There are several

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM: --- There are several

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM: --- There are several

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:08 PM: --- There are several

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810071#comment-16810071 ] Tim Allison commented on TIKA-2749: --- Thank you, [~tilman].  Fixed. > OCR on PDFs should "just work" out

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 4:49 PM: --- There are several

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810059#comment-16810059 ] Tilman Hausherr commented on TIKA-2749: --- You probably mean "vector graphics". > OCR on PDFs should

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809992#comment-16809992 ] Ashish Tiwari commented on TIKA-2847: - Please find below code snippet, below code snippet is used for

[jira] [Commented] (TIKA-2840) windows batch file not detected

2019-04-04 Thread chandra (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809949#comment-16809949 ] chandra commented on TIKA-2840: --- hi tim, Looks like simple batch files which are starting upper case @ECHO

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809928#comment-16809928 ] Tim Allison commented on TIKA-2847: --- How are you loading the PDF? Can you attach it/share it? You may

[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809904#comment-16809904 ] Ashish Tiwari commented on TIKA-2847: - Thanks Tim i will check by setting SAX docx parser, but what in

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 1:44 PM: --- There are several

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison commented on TIKA-2749: --- There are several reasons why one might want to run OCR on a PDF