RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-15 Thread John Dougrez-Lewis
Ok, subject to the two security safeguards discussed, if people are ok with this, please can the 'fileUrl' functionality be schedules to be added back in the next release ? -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: 14 September 2016 17:55 To:

[jira] [Updated] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaleb Akalework updated TIKA-2080: -- Attachment: nihao2.pdf This is the input file I used > PDFParser tika-parsers-1.13.jar not

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494305#comment-15494305 ] Kaleb Akalework commented on TIKA-2080: --- Opened ticket at PDFBOX under Tim Allisons advice >

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494300#comment-15494300 ] Kaleb Akalework commented on TIKA-2080: --- Under Tim Allisons advice, I opened a ticket under PDFBOX >

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494287#comment-15494287 ] Tim Allison commented on TIKA-2080: --- Under More->Attach Files. Make sure to share it on the PDFBox

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494272#comment-15494272 ] Kaleb Akalework commented on TIKA-2080: --- How can I share. I don't see how to upload a file. >

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494269#comment-15494269 ] Tim Allison commented on TIKA-2080: --- Please open a new issue on PDFBox's JIRA and link it to this one.

[jira] [Comment Edited] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494056#comment-15494056 ] Kaleb Akalework edited comment on TIKA-2080 at 9/15/16 5:45 PM: Thanks. I

[jira] [Comment Edited] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494056#comment-15494056 ] Kaleb Akalework edited comment on TIKA-2080 at 9/15/16 5:44 PM: Thanks. I

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494056#comment-15494056 ] Kaleb Akalework commented on TIKA-2080: --- Thanks. I still see the problem with the new PDFBox2.0.3

Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-15 Thread Bob Paulin
Hi Sergey, I definitely get the challenges. In fact recently we merged the PDF module into the Multimedia module due to the tight coupling around the TesseractOCR[1] [2]. We could look into separating the PDF parser out again but I'm a bit short on a simple way to do it with TesseractOCR in

[jira] [Commented] (TIKA-2055) Exception on parsing .docx file

2016-09-15 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493943#comment-15493943 ] Hudson commented on TIKA-2055: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1100 (See

[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-15 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493924#comment-15493924 ] Nick Burch commented on TIKA-2069: -- Yes! If you wrote a VB Script, and zipped it up, it'd be a

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493904#comment-15493904 ] Tim Allison commented on TIKA-2080: --- I just updated our wiki (see link above) to include the literal

[jira] [Commented] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493861#comment-15493861 ] Kaleb Akalework commented on TIKA-2080: --- So far I have been using the parser directly from Tika, but

[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493888#comment-15493888 ] Tim Allison commented on TIKA-1194: --- Y, this is still failing. > Missing text from MS Word (DOC) file >

[jira] [Resolved] (TIKA-1437) encoding issue in AutoDetectReader

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1437. --- Resolution: Cannot Reproduce Accents seem to work as expected with trunk. This may have been fixed

[jira] [Commented] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493869#comment-15493869 ] Tim Allison commented on TIKA-1829: --- When would the parseContext be null? Sorry for our delay! >

[jira] [Commented] (TIKA-1760) PDF index fulltext fails.

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493838#comment-15493838 ] Tim Allison commented on TIKA-1760: --- We've upgraded to PDFBox 2.0 as of Tika 1.13 Can you confirm that

[jira] [Created] (TIKA-2080) PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly

2016-09-15 Thread Kaleb Akalework (JIRA)
Kaleb Akalework created TIKA-2080: - Summary: PDFParser tika-parsers-1.13.jar not parsing Japanese and Chinese Characters correctly Key: TIKA-2080 URL: https://issues.apache.org/jira/browse/TIKA-2080

RE: PDF with embedded attachments and Tika 2.0 modularity

2016-09-15 Thread Allison, Timothy B.
Sergey, your point is well taken. Y, you'd need most parsers, but you can _probably_ live without advanced or scientific (sorry, Chris!). I'd be hesitant to change the structure much. We should definitely document this well, though! -Original Message- From: Sergey Beryozkin

tika-2.x-windows - Build # 46 - Still Failing

2016-09-15 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #46) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/46/ to view the results.

PDF with embedded attachments and Tika 2.0 modularity

2016-09-15 Thread Sergey Beryozkin
Hi All As Tim educated me, PDF (and indeed other formats) may have all sort of embedded attachments. In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for users to pick up only individual parsers. So I've added PDFParser & OpenDocumentParser and tike-core to the

[jira] [Resolved] (TIKA-1864) org.apache.poi.hssf.record.formula.UnaryPlusPtg package for tika-app-1.10

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1864. --- Resolution: Won't Fix Question for users list. > org.apache.poi.hssf.record.formula.UnaryPlusPtg

[jira] [Commented] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493792#comment-15493792 ] Tim Allison commented on TIKA-1997: --- [~gagravarr], any recommendations on this one? > Problem in

[jira] [Resolved] (TIKA-1838) Just a quick question regarding compatibility

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1838. --- Resolution: Won't Fix Question for users' list > Just a quick question regarding compatibility >

[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493780#comment-15493780 ] Tim Allison commented on TIKA-2069: --- Makes sense, although I'd prefer to write one parser rather than

[jira] [Resolved] (TIKA-2055) Exception on parsing .docx file

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2055. --- Resolution: Fixed Fix Version/s: 1.14 2.0 > Exception on parsing .docx file

RE: Tika 1.14?

2016-09-15 Thread Allison, Timothy B.
Let me touch back in a month. ;) Looks like PDFBox 2.0.3 and POI-3.15-beta3 or POI-3.15-final will be out shortly. Any blockers/wishes on 1.14? -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Friday, August 12, 2016 7:51 PM To: dev@tika.apache.org

[jira] [Commented] (TIKA-2079) Unknown embedded image file in ppt

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493303#comment-15493303 ] Tim Allison commented on TIKA-2079: --- First five bytes of the attached files: 00 01 00 00 B4 > Unknown

[jira] [Updated] (TIKA-2079) Unknown embedded image file in ppt

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2079: -- Description: We recently modified how we're extracting OLE wrapped embedded objects within ppts. On a

[jira] [Updated] (TIKA-2079) Unknown embedded image file in ppt

2016-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2079: -- Attachment: Root Entry_46.ttf Root Entry_44.ttf Root Entry_41.ttf

[jira] [Created] (TIKA-2079) Unknown embedded image file in ppt

2016-09-15 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2079: - Summary: Unknown embedded image file in ppt Key: TIKA-2079 URL: https://issues.apache.org/jira/browse/TIKA-2079 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files

2016-09-15 Thread Tim Barrett (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492536#comment-15492536 ] Tim Barrett commented on TIKA-2058: --- private void processFileEmbeddedInMsg(InformationGranule msgGranule,

[jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files

2016-09-15 Thread Tim Barrett (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492526#comment-15492526 ] Tim Barrett commented on TIKA-2058: --- Note the poifsSileSyetm.close that is commented out there. I think

[jira] [Commented] (TIKA-2058) Memory Leak in Tika version 1.13 when parsing millions of files

2016-09-15 Thread Tim Barrett (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492522#comment-15492522 ] Tim Barrett commented on TIKA-2058: --- private void processMsgEmbeddedInMsg(InformationGranule msgGranule,