[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents
[ https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504643#comment-15504643 ] Tim Allison commented on TIKA-2069: --- Just realized that we might want to handle extraction of Actions and/or javascript from PDFs in a similar way? New+related ticket if anyone has an interest? > Extract Macro text from Microsoft Office documents > -- > > Key: TIKA-2069 > URL: https://issues.apache.org/jira/browse/TIKA-2069 > Project: Tika > Issue Type: Improvement > Components: detector, parser >Affects Versions: 1.13 > Environment: RHEL 5.x, Apache Tomcat >Reporter: Jeff Swindle > Labels: features > Attachments: excel-macro.PNG, test-macro-doc.docm, > test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, > xlsmacro.xlsm.tika-app-output.txt > > > Tika supports macro-enabled Microsoft Office documents by extracting metadata > and contents, however, macros within the document are not in the metadata or > content output. > Desire is to have the macro text extracted also. > Info regarding macro extraction: http://www.decalage.info/vba_tools -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy
[ https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2084: -- Description: If we want a backoff on exception strategy, "try xmlparser, if that fails, try the TXTParser", we may want to have a resettable outputstream/contenthandler to clear what had been written by the first parser. (was: If we want a backoff on exception strategy, "try xmlparser, if that fails, try the TXTParser", we'll may want to have a resettable outputstream/contenthandler to clear what had been written by the first parser.) > Create resettable OutputStream to support "backoff on exception" strategy > - > > Key: TIKA-2084 > URL: https://issues.apache.org/jira/browse/TIKA-2084 > Project: Tika > Issue Type: New Feature > Components: core >Reporter: Tim Allison > > If we want a backoff on exception strategy, "try xmlparser, if that fails, > try the TXTParser", we may want to have a resettable > outputstream/contenthandler to clear what had been written by the first > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy
[ https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2084: -- Description: If we want a backoff on exception strategy, "try xmlparser, if that fails, try the TXTParser", we'll may want to have a resettable outputstream/contenthandler to clear what had been written by the first parser. (was: If we want a backoff on exception strategy, "try xmlparser, if that fails, try the TXTParser", we'll need to have a resettable outputstream/contenthandler to clear what had been written by the first parser.) > Create resettable OutputStream to support "backoff on exception" strategy > - > > Key: TIKA-2084 > URL: https://issues.apache.org/jira/browse/TIKA-2084 > Project: Tika > Issue Type: New Feature > Components: core >Reporter: Tim Allison > > If we want a backoff on exception strategy, "try xmlparser, if that fails, > try the TXTParser", we'll may want to have a resettable > outputstream/contenthandler to clear what had been written by the first > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy
[ https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504592#comment-15504592 ] Tim Allison commented on TIKA-2084: --- Good point. Thank you. > Create resettable OutputStream to support "backoff on exception" strategy > - > > Key: TIKA-2084 > URL: https://issues.apache.org/jira/browse/TIKA-2084 > Project: Tika > Issue Type: New Feature > Components: core >Reporter: Tim Allison > > If we want a backoff on exception strategy, "try xmlparser, if that fails, > try the TXTParser", we'll need to have a resettable > outputstream/contenthandler to clear what had been written by the first > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Plans for the first Tika 2.0 release
I think that could work! I've also created a custom filter that might help https://issues.apache.org/jira/browse/TIKA-2083?filter=12338448 Logic is as follows: project = TIKA AND affectedVersion = 2.0 AND priority >= Blocker AND status != Closed AND status != Fixed - Bob On 9/19/2016 1:40 PM, Allison, Timothy B. wrote: Should we create a tika-2_0-blocker label to differentiate from regular "blockers"? How about a single master issue: TIKA-2085. What else do we need to add?
[jira] [Commented] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy
[ https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504321#comment-15504321 ] Luis Filipe Nassif commented on TIKA-2084: -- I think the reset could be optional, because some cases the first parser, even throwing an exception, can extract valuable content, for example, when the exception is thrown while parsing the last page of a docx or pdf (when the flag to catch expections per page is not set) > Create resettable OutputStream to support "backoff on exception" strategy > - > > Key: TIKA-2084 > URL: https://issues.apache.org/jira/browse/TIKA-2084 > Project: Tika > Issue Type: New Feature > Components: core >Reporter: Tim Allison > > If we want a backoff on exception strategy, "try xmlparser, if that fails, > try the TXTParser", we'll need to have a resettable > outputstream/contenthandler to clear what had been written by the first > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Plans for the first Tika 2.0 release
> Should we create a tika-2_0-blocker label to differentiate from regular > "blockers"? How about a single master issue: TIKA-2085. What else do we need to add?
[jira] [Updated] (TIKA-1509) Create configurable strategies for composite parsers
[ https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1509: -- Issue Type: Sub-task (was: Improvement) Parent: TIKA-2085 > Create configurable strategies for composite parsers > > > Key: TIKA-1509 > URL: https://issues.apache.org/jira/browse/TIKA-1509 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison > > Several parsers can handle the same mime type, and we are currently ordering > which parser is chosen (roughly) by the alphabetic order of the parser class > name. > Let's allow users to configure strategies for picking parsers. > See and contribute to full discussion here: > http://wiki.apache.org/tika/CompositeParserDiscussion -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy
[ https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2084: -- Issue Type: New Feature (was: Sub-task) Parent: (was: TIKA-1509) > Create resettable OutputStream to support "backoff on exception" strategy > - > > Key: TIKA-2084 > URL: https://issues.apache.org/jira/browse/TIKA-2084 > Project: Tika > Issue Type: New Feature > Components: core >Reporter: Tim Allison > > If we want a backoff on exception strategy, "try xmlparser, if that fails, > try the TXTParser", we'll need to have a resettable > outputstream/contenthandler to clear what had been written by the first > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1607: -- Issue Type: Sub-task (was: Improvement) Parent: TIKA-2085 > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Sub-task > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.14 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties
[ https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1974: -- Issue Type: Sub-task (was: Task) Parent: TIKA-2085 > Tika 2.0 - remove deprecated metadata properties > > > Key: TIKA-1974 > URL: https://issues.apache.org/jira/browse/TIKA-1974 > Project: Tika > Issue Type: Sub-task >Reporter: Tim Allison > > We have quite a few metadata properties that are deprecated. We should > remove them for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch
[ https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2083: -- Issue Type: Sub-task (was: Task) Parent: TIKA-2085 > Tika 2.0 - Audit master branch against 2.x branch > - > > Key: TIKA-2083 > URL: https://issues.apache.org/jira/browse/TIKA-2083 > Project: Tika > Issue Type: Sub-task >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin >Priority: Blocker > Fix For: 2.0 > > > At this point Tika has been doing parallel development on master and the 2.x > for about 9 months. We should audit commit logs for that time to make a best > effort to identify any commits that may not have been applied in 2.x. This > task should be done prior to the 2.0 release -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2085) Tika 2.0 -- Overarching task list for what we need to do before 2.0
Tim Allison created TIKA-2085: - Summary: Tika 2.0 -- Overarching task list for what we need to do before 2.0 Key: TIKA-2085 URL: https://issues.apache.org/jira/browse/TIKA-2085 Project: Tika Issue Type: Task Reporter: Tim Allison Let's use this issue to track issues that absolutely, positively have to be completed before we release Tika 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Plans for the first Tika 2.0 release
>> 1) Implement various strategies for chaining multiple parsers against >> individual files. Much of this has been implemented, but what's holding us >> up on this one (I think?) is a resettable outputstream. >I think we need a JIRA for this. Is there any existing design ideas on how >this would be achieved? Opened TIKA-2084 as subtask of TIKA-1509 > 2) Rich metadata (TIKA-1607) This is great. I think we need to ensure we have JIRAs for all the features we consider blockers and label them as such. This looks like there's a lot of good discussion. It also references TIKA-1903 so is that also a Tika 2.0 blocker? TIKA-1903 is not a blocker on 2.0, and may be obviated by TIKA-1607. >> 1) Get rid of old metadata tags in favor of "new" Dublin core >Need JIRA? Sorry, opened a good while ago: TIKA-1974 > If we can't get a date we should at least try to eliminate the ???. I think > we need to close down the feature set. Y, completely agree. Should we create a tika-2_0-blocker label to differentiate from regular "blockers"?
[jira] [Created] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy
Tim Allison created TIKA-2084: - Summary: Create resettable OutputStream to support "backoff on exception" strategy Key: TIKA-2084 URL: https://issues.apache.org/jira/browse/TIKA-2084 Project: Tika Issue Type: Sub-task Components: core Reporter: Tim Allison If we want a backoff on exception strategy, "try xmlparser, if that fails, try the TXTParser", we'll need to have a resettable outputstream/contenthandler to clear what had been written by the first parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch
Bob Paulin created TIKA-2083: Summary: Tika 2.0 - Audit master branch against 2.x branch Key: TIKA-2083 URL: https://issues.apache.org/jira/browse/TIKA-2083 Project: Tika Issue Type: Task Affects Versions: 2.0 Reporter: Bob Paulin Assignee: Bob Paulin Priority: Blocker Fix For: 2.0 At this point Tika has been doing parallel development on master and the 2.x for about 9 months. We should audit commit logs for that time to make a best effort to identify any commits that may not have been applied in 2.x. This task should be done prior to the 2.0 release -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Plans for the first Tika 2.0 release
Thanks Tim! Replies in line. - Bob On 9/19/2016 12:33 PM, Allison, Timothy B. wrote: Bob, As always, thank you for driving 2.0! My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Agreed. I think we're already missing a few things. Yikes is there a way we can audit what we might have missed? Perhaps we need a JIRA to do an audit of the commits in master and do a best effort of what might have been missed? I can create the JIRA for this. Would it make sense to at least put a date out there for a feature cut off? I'd be hesitant to do this. To my mind, the key is the actual features and devs who have time to implement them. Ok this is a start to understand what the blocking features are. The key will be creating concrete JIRAs for them and identifying where we are at. For me, the blocking new features are: 1) Implement various strategies for chaining multiple parsers against individual files. Much of this has been implemented, but what's holding us up on this one (I think?) is a resettable outputstream. I think we need a JIRA for this. Is there any existing design ideas on how this would be achieved? 2) Rich metadata (TIKA-1607) This is great. I think we need to ensure we have JIRAs for all the features we consider blockers and label them as such. This looks like there's a lot of good discussion. It also references TIKA-1903 so is that also a Tika 2.0 blocker? The blocking tasks: 1) Get rid of old metadata tags in favor of "new" Dublin core Need JIRA? 2) ??? If we can't get a date we should at least try to eliminate the ???. I think we need to close down the feature set. I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can turn to 2.0-specific development. What else do we have to do? Anyone else have some time? Yes please would be great to see if there are people that want to own work on the above features. Once we have JIRAs we can post to the Apache Help Wanted page as well. Thanks! Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 19, 2016 10:32 AM To: dev@tika.apache.org Subject: Re: Plans for the first Tika 2.0 release Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob
RE: Plans for the first Tika 2.0 release
Bob, As always, thank you for driving 2.0! > My concern is we have been dual maintaining 2 branches for about 9 months. I > think the longer we do this the more risk there is that we miss something. Agreed. I think we're already missing a few things. > Would it make sense to at least put a date out there for a feature cut off? I'd be hesitant to do this. To my mind, the key is the actual features and devs who have time to implement them. For me, the blocking new features are: 1) Implement various strategies for chaining multiple parsers against individual files. Much of this has been implemented, but what's holding us up on this one (I think?) is a resettable outputstream. 2) Rich metadata (TIKA-1607) The blocking tasks: 1) Get rid of old metadata tags in favor of "new" Dublin core 2) ??? I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can turn to 2.0-specific development. What else do we have to do? Anyone else have some time? Cheers, Tim -Original Message- From: Bob Paulin [mailto:b...@bobpaulin.com] Sent: Monday, September 19, 2016 10:32 AM To: dev@tika.apache.org Subject: Re: Plans for the first Tika 2.0 release Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob
[jira] [Commented] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES
[ https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504023#comment-15504023 ] Nick Burch commented on TIKA-1997: -- Running your file through the openssl tool {{ asn1parse }}, it shows your file as having / being a first object of type {{ pkcs7-signedData }}. It also shows the signature from {{ INFOCERT SPA }}. So, it does look to be a signed PKCS7 file, and hence Tika appears to be doing the right thing Unless I've mis-understood something about PKCS7 files and/or the asn1 dump output? > Problem in Tika().detect for xml file signed in CADES > - > > Key: TIKA-1997 > URL: https://issues.apache.org/jira/browse/TIKA-1997 > Project: Tika > Issue Type: Sub-task > Components: detector >Affects Versions: 1.13 > Environment: JDK 1.7 >Reporter: Michele Andreano >Priority: Blocker > Attachments: test.xml.p7m > > > When I submit a tika a xml file signed in P7M format, I expect tika return as > mimetype application / pkcs7-mime instead gives me application / > pkcs7-signature. > How is it possible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Plans for the first Tika 2.0 release
Hi, I think it's a good thing to discuss. I know there are other features that are targeted for 2.0. Do we have a general sense of where those features are at? My concern is we have been dual maintaining 2 branches for about 9 months. I think the longer we do this the more risk there is that we miss something. Would it make sense to at least put a date out there for a feature cut off? There's always 3.0 if things are not close to being ready. - Bob On 9/19/2016 4:32 AM, Sergey Beryozkin wrote: Hi All Back in May I updated one of our CXF demos on the master 3.2 branch to depend on Tika 2.0 SNAPSHOT to verify the new module system works well. It is feasible that CXF 3.2.0 may be released by the end of the year or early next year. As far as Tika 2.0 dependencies are concerned it will be easy for me to update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 2.0 is released by the time CXF 3.2 is about to be released then I'll be happy to keep 2.0 deps. Are there any plans to get Tika 2.0 out in the next few months ? Cheers, Sergey
tika-2.x-windows - Build # 48 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #48) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/48/ to view the results.
[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open
[ https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503616#comment-15503616 ] Hudson commented on TIKA-2015: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #48 (See [https://builds.apache.org/job/tika-2.x-windows/48/]) TIKA-2015 -- upgrade to PDFBox 2.0.3 (tallison: rev 1b32e31864829acd80620763cf3c4d928b8d8346) * (edit) tika-parser-modules/pom.xml * (edit) CHANGES.txt > MAPIMessage String fileName constructor leaves file open > > > Key: TIKA-2015 > URL: https://issues.apache.org/jira/browse/TIKA-2015 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11, 1.12 >Reporter: Tim Barrett > > When extracting attachments from MSG resources, using MAPIMessage constructor > with string file path of msg resource leads to an open file handle on msg fle > that is never closed, there is no way to close this as MAPIMessage does not > have a close method. This behaviour first manifests itself in version 1.11 > and all subsequent versions (1.12, 1.13). Use LSOF or file-leak-detector to > reproduce this - create instance of MAPIMessage using string constructor - > file-leak-detector will show the open file being created at that point, file > handle is then never dropped. > Using input stream constructor is a workaround as this allows the calling > program to cleanly close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2051) Upgrade to PDFBox 2.0.3 when available
[ https://issues.apache.org/jira/browse/TIKA-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503579#comment-15503579 ] Hudson commented on TIKA-2051: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1102 (See [https://builds.apache.org/job/Tika-trunk/1102/]) TIKA-2051 -- upgrade to PDFBox 2.0.3 (tallison: rev 07aea36f71c17236782bff0b61855578722d933e) * (edit) tika-parsers/pom.xml * (edit) CHANGES.txt > Upgrade to PDFBox 2.0.3 when available > -- > > Key: TIKA-2051 > URL: https://issues.apache.org/jira/browse/TIKA-2051 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open
[ https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503577#comment-15503577 ] Tim Allison commented on TIKA-2015: --- Doh. Typo in commit message. Should have been TIKA-2051. > MAPIMessage String fileName constructor leaves file open > > > Key: TIKA-2015 > URL: https://issues.apache.org/jira/browse/TIKA-2015 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11, 1.12 >Reporter: Tim Barrett > > When extracting attachments from MSG resources, using MAPIMessage constructor > with string file path of msg resource leads to an open file handle on msg fle > that is never closed, there is no way to close this as MAPIMessage does not > have a close method. This behaviour first manifests itself in version 1.11 > and all subsequent versions (1.12, 1.13). Use LSOF or file-leak-detector to > reproduce this - create instance of MAPIMessage using string constructor - > file-leak-detector will show the open file being created at that point, file > handle is then never dropped. > Using input stream constructor is a workaround as this allows the calling > program to cleanly close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open
[ https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503535#comment-15503535 ] Hudson commented on TIKA-2015: -- SUCCESS: Integrated in Jenkins build tika-2.x #144 (See [https://builds.apache.org/job/tika-2.x/144/]) TIKA-2015 -- upgrade to PDFBox 2.0.3 (tallison: rev 1b32e31864829acd80620763cf3c4d928b8d8346) * (edit) tika-parser-modules/pom.xml * (edit) CHANGES.txt > MAPIMessage String fileName constructor leaves file open > > > Key: TIKA-2015 > URL: https://issues.apache.org/jira/browse/TIKA-2015 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11, 1.12 >Reporter: Tim Barrett > > When extracting attachments from MSG resources, using MAPIMessage constructor > with string file path of msg resource leads to an open file handle on msg fle > that is never closed, there is no way to close this as MAPIMessage does not > have a close method. This behaviour first manifests itself in version 1.11 > and all subsequent versions (1.12, 1.13). Use LSOF or file-leak-detector to > reproduce this - create instance of MAPIMessage using string constructor - > file-leak-detector will show the open file being created at that point, file > handle is then never dropped. > Using input stream constructor is a workaround as this allows the calling > program to cleanly close the input stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2082) Upgrade to PDFBox 2.0.3
[ https://issues.apache.org/jira/browse/TIKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503467#comment-15503467 ] Tim Allison commented on TIKA-2082: --- No need to apologize whatsoever. Thank you for the ping! > Upgrade to PDFBox 2.0.3 > --- > > Key: TIKA-2082 > URL: https://issues.apache.org/jira/browse/TIKA-2082 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Luis Filipe Nassif > Fix For: 2.0, 1.14 > > > PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2045: -- Fix Version/s: 1.14 2.0 > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > Fix For: 2.0, 1.14 > > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF
[ https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2045. --- Resolution: Fixed Upgraded to PDFBox 2.0.3. > TIKA crashes / runs out of memory on simple PDF > --- > > Key: TIKA-2045 > URL: https://issues.apache.org/jira/browse/TIKA-2045 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.13 > Environment: Linux, Java 8 >Reporter: Egbert > > We're using TIKA embedded in a webcrawler and today I've encountered a PDF > that results in OutOfMemory errors while being processed by TIKA. > It's a small, 1 page PDF file, so I don't think that it should consume that > much memory. > I verified the problem by using the GUI from the tika-app-1.13.jar file and > that results in the same error on the same file. The file can be found at: > http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf > If I can help by providing any additional information, please let me know. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2051) Upgrade to PDFBox 2.0.3 when available
[ https://issues.apache.org/jira/browse/TIKA-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2051. --- Resolution: Fixed Fix Version/s: 1.14 2.0 > Upgrade to PDFBox 2.0.3 when available > -- > > Key: TIKA-2051 > URL: https://issues.apache.org/jira/browse/TIKA-2051 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2082) Upgrade to PDFBox 2.0.3
[ https://issues.apache.org/jira/browse/TIKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503458#comment-15503458 ] Luis Filipe Nassif commented on TIKA-2082: -- Sorry Tim, did not see Tika-2051 > Upgrade to PDFBox 2.0.3 > --- > > Key: TIKA-2082 > URL: https://issues.apache.org/jira/browse/TIKA-2082 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Luis Filipe Nassif > Fix For: 2.0, 1.14 > > > PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2082) Upgrade to PDFBox 2.0.3
[ https://issues.apache.org/jira/browse/TIKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2082. --- Resolution: Duplicate Fix Version/s: 2.0 Building locally now before I commit (should be 10-15 minutes). Thank you! > Upgrade to PDFBox 2.0.3 > --- > > Key: TIKA-2082 > URL: https://issues.apache.org/jira/browse/TIKA-2082 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Luis Filipe Nassif > Fix For: 2.0, 1.14 > > > PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2082) Upgrade to PDFBox 2.0.3
Luis Filipe Nassif created TIKA-2082: Summary: Upgrade to PDFBox 2.0.3 Key: TIKA-2082 URL: https://issues.apache.org/jira/browse/TIKA-2082 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.13 Reporter: Luis Filipe Nassif Fix For: 1.14 PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Plans for the first Tika 2.0 release
Hi All Back in May I updated one of our CXF demos on the master 3.2 branch to depend on Tika 2.0 SNAPSHOT to verify the new module system works well. It is feasible that CXF 3.2.0 may be released by the end of the year or early next year. As far as Tika 2.0 dependencies are concerned it will be easy for me to update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 2.0 is released by the time CXF 3.2 is about to be released then I'll be happy to keep 2.0 deps. Are there any plans to get Tika 2.0 out in the next few months ? Cheers, Sergey