[jira] [Commented] (TIKA-2069) Extract Macro text from Microsoft Office documents

2016-09-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504643#comment-15504643
 ] 

Tim Allison commented on TIKA-2069:
---

Just realized that we might want to handle extraction of Actions and/or 
javascript from PDFs in a similar way?  New+related ticket if anyone has an 
interest?

> Extract Macro text from Microsoft Office documents
> --
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
>  Issue Type: Improvement
>  Components: detector, parser
>Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
>Reporter: Jeff Swindle
>  Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2084:
--
Description: If we want a backoff on exception strategy, "try xmlparser, if 
that fails, try the TXTParser", we may want to have a resettable 
outputstream/contenthandler to clear what had been written by the first parser. 
 (was: If we want a backoff on exception strategy, "try xmlparser, if that 
fails, try the TXTParser", we'll may want to have a resettable 
outputstream/contenthandler to clear what had been written by the first parser.)

> Create resettable OutputStream to support "backoff on exception" strategy
> -
>
> Key: TIKA-2084
> URL: https://issues.apache.org/jira/browse/TIKA-2084
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Tim Allison
>
> If we want a backoff on exception strategy, "try xmlparser, if that fails, 
> try the TXTParser", we may want to have a resettable 
> outputstream/contenthandler to clear what had been written by the first 
> parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2084:
--
Description: If we want a backoff on exception strategy, "try xmlparser, if 
that fails, try the TXTParser", we'll may want to have a resettable 
outputstream/contenthandler to clear what had been written by the first parser. 
 (was: If we want a backoff on exception strategy, "try xmlparser, if that 
fails, try the TXTParser", we'll need to have a resettable 
outputstream/contenthandler to clear what had been written by the first parser.)

> Create resettable OutputStream to support "backoff on exception" strategy
> -
>
> Key: TIKA-2084
> URL: https://issues.apache.org/jira/browse/TIKA-2084
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Tim Allison
>
> If we want a backoff on exception strategy, "try xmlparser, if that fails, 
> try the TXTParser", we'll may want to have a resettable 
> outputstream/contenthandler to clear what had been written by the first 
> parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy

2016-09-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504592#comment-15504592
 ] 

Tim Allison commented on TIKA-2084:
---

Good point.  Thank you.

> Create resettable OutputStream to support "backoff on exception" strategy
> -
>
> Key: TIKA-2084
> URL: https://issues.apache.org/jira/browse/TIKA-2084
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Tim Allison
>
> If we want a backoff on exception strategy, "try xmlparser, if that fails, 
> try the TXTParser", we'll need to have a resettable 
> outputstream/contenthandler to clear what had been written by the first 
> parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

I think that could work!  I've also created a custom filter that might help

https://issues.apache.org/jira/browse/TIKA-2083?filter=12338448

Logic is as follows:

project = TIKA AND affectedVersion = 2.0 AND priority >= Blocker AND 
status != Closed AND status != Fixed



- Bob


On 9/19/2016 1:40 PM, Allison, Timothy B. wrote:

Should we create a tika-2_0-blocker label to differentiate from regular 
"blockers"?

How about a single master issue: TIKA-2085.

What else do we need to add?




[jira] [Commented] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy

2016-09-19 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504321#comment-15504321
 ] 

Luis Filipe Nassif commented on TIKA-2084:
--

I think the reset could be optional, because some cases the first parser, even 
throwing an exception, can extract valuable content, for example, when the 
exception is thrown while parsing the last page of a docx or pdf (when the flag 
to catch expections per page is not set)

> Create resettable OutputStream to support "backoff on exception" strategy
> -
>
> Key: TIKA-2084
> URL: https://issues.apache.org/jira/browse/TIKA-2084
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Tim Allison
>
> If we want a backoff on exception strategy, "try xmlparser, if that fails, 
> try the TXTParser", we'll need to have a resettable 
> outputstream/contenthandler to clear what had been written by the first 
> parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
> Should we create a tika-2_0-blocker label to differentiate from regular 
> "blockers"?

How about a single master issue: TIKA-2085.

What else do we need to add?


[jira] [Updated] (TIKA-1509) Create configurable strategies for composite parsers

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1509:
--
Issue Type: Sub-task  (was: Improvement)
Parent: TIKA-2085

> Create configurable strategies for composite parsers
> 
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering 
> which parser is chosen (roughly) by the alphabetic order of the parser class 
> name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here: 
> http://wiki.apache.org/tika/CompositeParserDiscussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2084:
--
Issue Type: New Feature  (was: Sub-task)
Parent: (was: TIKA-1509)

> Create resettable OutputStream to support "backoff on exception" strategy
> -
>
> Key: TIKA-2084
> URL: https://issues.apache.org/jira/browse/TIKA-2084
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Tim Allison
>
> If we want a backoff on exception strategy, "try xmlparser, if that fails, 
> try the TXTParser", we'll need to have a resettable 
> outputstream/contenthandler to clear what had been written by the first 
> parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1607:
--
Issue Type: Sub-task  (was: Improvement)
Parent: TIKA-2085

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.14
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1974) Tika 2.0 - remove deprecated metadata properties

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1974:
--
Issue Type: Sub-task  (was: Task)
Parent: TIKA-2085

> Tika 2.0 - remove deprecated metadata properties
> 
>
> Key: TIKA-1974
> URL: https://issues.apache.org/jira/browse/TIKA-1974
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>
> We have quite a few metadata properties that are deprecated.  We should 
> remove them for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2083:
--
Issue Type: Sub-task  (was: Task)
Parent: TIKA-2085

> Tika 2.0 - Audit master branch against 2.x branch
> -
>
> Key: TIKA-2083
> URL: https://issues.apache.org/jira/browse/TIKA-2083
> Project: Tika
>  Issue Type: Sub-task
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>Priority: Blocker
> Fix For: 2.0
>
>
> At this point Tika has been doing parallel development on master and the 2.x 
> for about 9 months.  We should audit commit logs for that time to make a best 
> effort to identify any commits that may not have been applied in 2.x.  This 
> task should be done prior to the 2.0 release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2085) Tika 2.0 -- Overarching task list for what we need to do before 2.0

2016-09-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2085:
-

 Summary: Tika 2.0 -- Overarching task list for what we need to do 
before 2.0
 Key: TIKA-2085
 URL: https://issues.apache.org/jira/browse/TIKA-2085
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Let's use this issue to track issues that absolutely, positively have to be 
completed before we release Tika 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
>> 1) Implement various strategies for chaining multiple parsers against 
>> individual files.  Much of this has been implemented, but what's holding us 
>> up on this one (I think?) is a resettable outputstream.
>I think we need a JIRA for this.  Is there any existing design ideas on how 
>this would be achieved?
Opened TIKA-2084 as subtask of TIKA-1509

> 2) Rich metadata (TIKA-1607)
This is great.  I think we need to ensure we have JIRAs for all the features we 
consider blockers and label them as such.  This looks like there's a lot of 
good discussion.  It also references TIKA-1903 so is that also a Tika 2.0 
blocker?
TIKA-1903 is not a blocker on 2.0, and may be obviated by TIKA-1607.

>> 1) Get rid of old metadata tags in favor of "new" Dublin core
>Need JIRA?
Sorry, opened a good while ago: TIKA-1974

> If we can't get a date we should at least try to eliminate the ???. I think 
> we need to close down the feature set.
Y, completely agree.

Should we create a tika-2_0-blocker label to differentiate from regular 
"blockers"?


[jira] [Created] (TIKA-2084) Create resettable OutputStream to support "backoff on exception" strategy

2016-09-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2084:
-

 Summary: Create resettable OutputStream to support "backoff on 
exception" strategy
 Key: TIKA-2084
 URL: https://issues.apache.org/jira/browse/TIKA-2084
 Project: Tika
  Issue Type: Sub-task
  Components: core
Reporter: Tim Allison


If we want a backoff on exception strategy, "try xmlparser, if that fails, try 
the TXTParser", we'll need to have a resettable outputstream/contenthandler to 
clear what had been written by the first parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2016-09-19 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-2083:


 Summary: Tika 2.0 - Audit master branch against 2.x branch
 Key: TIKA-2083
 URL: https://issues.apache.org/jira/browse/TIKA-2083
 Project: Tika
  Issue Type: Task
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin
Priority: Blocker
 Fix For: 2.0


At this point Tika has been doing parallel development on master and the 2.x 
for about 9 months.  We should audit commit logs for that time to make a best 
effort to identify any commits that may not have been applied in 2.x.  This 
task should be done prior to the 2.0 release



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

Thanks Tim!  Replies in line.

- Bob
On 9/19/2016 12:33 PM, Allison, Timothy B. wrote:

Bob,
   As always, thank you for driving 2.0!


My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.

Agreed.  I think we're already missing a few things.
Yikes is there a way we can audit what we might have missed? Perhaps we 
need a JIRA to do an audit of the commits in master and do a best effort 
of what might have been missed?  I can create the JIRA for this.



Would it make sense to at least put a date out there for a feature cut off?

I'd be hesitant to do this.  To my mind, the key is the actual features and 
devs who have time to implement them.
Ok this is a start to understand what the blocking features are. The key 
will be creating concrete JIRAs for them and identifying where we are at.


For me, the blocking new features are:

1) Implement various strategies for chaining multiple parsers against 
individual files.  Much of this has been implemented, but what's holding us up 
on this one (I think?) is a resettable outputstream.
I think we need a JIRA for this.  Is there any existing design ideas on 
how this would be achieved?


2) Rich metadata (TIKA-1607)
This is great.  I think we need to ensure we have JIRAs for all the 
features we consider blockers and label them as such.  This looks like 
there's a lot of good discussion.  It also references TIKA-1903 so is 
that also a Tika 2.0 blocker?


The blocking tasks:
1) Get rid of old metadata tags in favor of "new" Dublin core

Need JIRA?

2) ???
If we can't get a date we should at least try to eliminate the ???. I 
think we need to close down the feature set.


I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can 
turn to 2.0-specific development.

What else do we have to do? Anyone else have some time?


Yes please would be great to see if there are people that want to own 
work on the above features.  Once we have JIRAs we can post to the 
Apache Help Wanted page as well.


Thanks!



Cheers,

Tim

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com]
Sent: Monday, September 19, 2016 10:32 AM
To: dev@tika.apache.org
Subject: Re: Plans for the first Tika 2.0 release

Hi,

I think it's a good thing to discuss.  I know there are other features that are 
targeted for 2.0.  Do we have a general sense of where those features are at?  
My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.  
Would it make sense to at least put a date
out there for a feature cut off?   There's always 3.0 if things are not
close to being ready.


- Bob






RE: Plans for the first Tika 2.0 release

2016-09-19 Thread Allison, Timothy B.
Bob,
  As always, thank you for driving 2.0!

> My concern is we have been dual maintaining 2 branches for about 9 months.  I 
> think the longer we do this the more risk there is that we miss something.  

Agreed.  I think we're already missing a few things.

> Would it make sense to at least put a date out there for a feature cut off?

I'd be hesitant to do this.  To my mind, the key is the actual features and 
devs who have time to implement them.

For me, the blocking new features are:

1) Implement various strategies for chaining multiple parsers against 
individual files.  Much of this has been implemented, but what's holding us up 
on this one (I think?) is a resettable outputstream.

2) Rich metadata (TIKA-1607)

The blocking tasks:
1) Get rid of old metadata tags in favor of "new" Dublin core
2) ???

I'm full up on other stuff at the moment, perhaps after we get 1.14 out, I can 
turn to 2.0-specific development.

What else do we have to do? Anyone else have some time?

Cheers,

   Tim

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 19, 2016 10:32 AM
To: dev@tika.apache.org
Subject: Re: Plans for the first Tika 2.0 release

Hi,

I think it's a good thing to discuss.  I know there are other features that are 
targeted for 2.0.  Do we have a general sense of where those features are at?  
My concern is we have been dual maintaining 2 branches for about 9 months.  I 
think the longer we do this the more risk there is that we miss something.  
Would it make sense to at least put a date 
out there for a feature cut off?   There's always 3.0 if things are not 
close to being ready.


- Bob




[jira] [Commented] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2016-09-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504023#comment-15504023
 ] 

Nick Burch commented on TIKA-1997:
--

Running your file through the openssl tool {{ asn1parse }}, it shows your file 
as having / being a first object of type {{ pkcs7-signedData }}. It also shows 
the signature from {{ INFOCERT SPA }}. So, it does look to be a signed PKCS7 
file, and hence Tika appears to be doing the right thing

Unless I've mis-understood something about PKCS7 files and/or the asn1 dump 
output?

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Plans for the first Tika 2.0 release

2016-09-19 Thread Bob Paulin

Hi,

I think it's a good thing to discuss.  I know there are other features 
that are targeted for 2.0.  Do we have a general sense of where those 
features are at?  My concern is we have been dual maintaining 2 branches 
for about 9 months.  I think the longer we do this the more risk there 
is that we miss something.  Would it make sense to at least put a date 
out there for a feature cut off?   There's always 3.0 if things are not 
close to being ready.



- Bob


On 9/19/2016 4:32 AM, Sergey Beryozkin wrote:

Hi All

Back in May I updated one of our CXF demos on the master 3.2 branch to 
depend on Tika 2.0 SNAPSHOT to verify the new module system works well.
It is feasible that CXF 3.2.0 may be released by the end of the year 
or early next year.
As far as Tika 2.0 dependencies are concerned it will be easy for me 
to update the demo to temporarily depend on Tika 1.13 or 1.14. But if 
Tika 2.0 is released by the time CXF 3.2 is about to be released then 
I'll be happy to keep 2.0 deps.

Are there any plans to get Tika 2.0 out in the next few months ?

Cheers, Sergey








tika-2.x-windows - Build # 48 - Still Failing

2016-09-19 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-2.x-windows (build #48)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/48/ to 
view the results.

[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open

2016-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503616#comment-15503616
 ] 

Hudson commented on TIKA-2015:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #48 (See 
[https://builds.apache.org/job/tika-2.x-windows/48/])
TIKA-2015 -- upgrade to PDFBox 2.0.3 (tallison: rev 
1b32e31864829acd80620763cf3c4d928b8d8346)
* (edit) tika-parser-modules/pom.xml
* (edit) CHANGES.txt


> MAPIMessage String fileName constructor leaves file open
> 
>
> Key: TIKA-2015
> URL: https://issues.apache.org/jira/browse/TIKA-2015
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11, 1.12
>Reporter: Tim Barrett
>
> When extracting attachments from MSG resources, using MAPIMessage constructor 
> with string file path of msg resource leads to an open file handle on msg fle 
> that is never closed, there is no way to close this as MAPIMessage does not 
> have a close method. This behaviour first manifests itself in version 1.11 
> and all subsequent versions (1.12, 1.13). Use LSOF or file-leak-detector to 
> reproduce this - create instance of MAPIMessage using string constructor - 
> file-leak-detector will show the open file being created at that point, file 
> handle is then never dropped.
> Using input stream constructor is a workaround as this allows the calling 
> program to cleanly close the input stream. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2051) Upgrade to PDFBox 2.0.3 when available

2016-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503579#comment-15503579
 ] 

Hudson commented on TIKA-2051:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1102 (See 
[https://builds.apache.org/job/Tika-trunk/1102/])
TIKA-2051 -- upgrade to PDFBox 2.0.3 (tallison: rev 
07aea36f71c17236782bff0b61855578722d933e)
* (edit) tika-parsers/pom.xml
* (edit) CHANGES.txt


> Upgrade to PDFBox 2.0.3 when available
> --
>
> Key: TIKA-2051
> URL: https://issues.apache.org/jira/browse/TIKA-2051
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open

2016-09-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503577#comment-15503577
 ] 

Tim Allison commented on TIKA-2015:
---

Doh.  Typo in commit message.  Should have been TIKA-2051.

> MAPIMessage String fileName constructor leaves file open
> 
>
> Key: TIKA-2015
> URL: https://issues.apache.org/jira/browse/TIKA-2015
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11, 1.12
>Reporter: Tim Barrett
>
> When extracting attachments from MSG resources, using MAPIMessage constructor 
> with string file path of msg resource leads to an open file handle on msg fle 
> that is never closed, there is no way to close this as MAPIMessage does not 
> have a close method. This behaviour first manifests itself in version 1.11 
> and all subsequent versions (1.12, 1.13). Use LSOF or file-leak-detector to 
> reproduce this - create instance of MAPIMessage using string constructor - 
> file-leak-detector will show the open file being created at that point, file 
> handle is then never dropped.
> Using input stream constructor is a workaround as this allows the calling 
> program to cleanly close the input stream. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2015) MAPIMessage String fileName constructor leaves file open

2016-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503535#comment-15503535
 ] 

Hudson commented on TIKA-2015:
--

SUCCESS: Integrated in Jenkins build tika-2.x #144 (See 
[https://builds.apache.org/job/tika-2.x/144/])
TIKA-2015 -- upgrade to PDFBox 2.0.3 (tallison: rev 
1b32e31864829acd80620763cf3c4d928b8d8346)
* (edit) tika-parser-modules/pom.xml
* (edit) CHANGES.txt


> MAPIMessage String fileName constructor leaves file open
> 
>
> Key: TIKA-2015
> URL: https://issues.apache.org/jira/browse/TIKA-2015
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11, 1.12
>Reporter: Tim Barrett
>
> When extracting attachments from MSG resources, using MAPIMessage constructor 
> with string file path of msg resource leads to an open file handle on msg fle 
> that is never closed, there is no way to close this as MAPIMessage does not 
> have a close method. This behaviour first manifests itself in version 1.11 
> and all subsequent versions (1.12, 1.13). Use LSOF or file-leak-detector to 
> reproduce this - create instance of MAPIMessage using string constructor - 
> file-leak-detector will show the open file being created at that point, file 
> handle is then never dropped.
> Using input stream constructor is a workaround as this allows the calling 
> program to cleanly close the input stream. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2082) Upgrade to PDFBox 2.0.3

2016-09-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503467#comment-15503467
 ] 

Tim Allison commented on TIKA-2082:
---

No need to apologize whatsoever.  Thank you for the ping!

> Upgrade to PDFBox 2.0.3
> ---
>
> Key: TIKA-2082
> URL: https://issues.apache.org/jira/browse/TIKA-2082
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Luis Filipe Nassif
> Fix For: 2.0, 1.14
>
>
> PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2045:
--
Fix Version/s: 1.14
   2.0

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
> Fix For: 2.0, 1.14
>
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2045) TIKA crashes / runs out of memory on simple PDF

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2045.
---
Resolution: Fixed

Upgraded to PDFBox 2.0.3.

> TIKA crashes / runs out of memory on simple PDF
> ---
>
> Key: TIKA-2045
> URL: https://issues.apache.org/jira/browse/TIKA-2045
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.13
> Environment: Linux, Java 8
>Reporter: Egbert
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> It's a small, 1 page PDF file, so I don't think that it should consume that 
> much memory.
> I verified the problem by using the GUI from the tika-app-1.13.jar file and 
> that results in the same error on the same file. The file can be found at:
> http://www.spesmea.nl/pdf/algemene_voorwaarden_bbztcn_2010_nl.pdf
> If I can help by providing any additional information, please let me know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2051) Upgrade to PDFBox 2.0.3 when available

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2051.
---
   Resolution: Fixed
Fix Version/s: 1.14
   2.0

> Upgrade to PDFBox 2.0.3 when available
> --
>
> Key: TIKA-2051
> URL: https://issues.apache.org/jira/browse/TIKA-2051
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2082) Upgrade to PDFBox 2.0.3

2016-09-19 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503458#comment-15503458
 ] 

Luis Filipe Nassif commented on TIKA-2082:
--

Sorry Tim, did not see Tika-2051

> Upgrade to PDFBox 2.0.3
> ---
>
> Key: TIKA-2082
> URL: https://issues.apache.org/jira/browse/TIKA-2082
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Luis Filipe Nassif
> Fix For: 2.0, 1.14
>
>
> PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-2082) Upgrade to PDFBox 2.0.3

2016-09-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2082.
---
   Resolution: Duplicate
Fix Version/s: 2.0

Building locally now before I commit (should be 10-15 minutes).  Thank you!

> Upgrade to PDFBox 2.0.3
> ---
>
> Key: TIKA-2082
> URL: https://issues.apache.org/jira/browse/TIKA-2082
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Luis Filipe Nassif
> Fix For: 2.0, 1.14
>
>
> PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2082) Upgrade to PDFBox 2.0.3

2016-09-19 Thread Luis Filipe Nassif (JIRA)
Luis Filipe Nassif created TIKA-2082:


 Summary: Upgrade to PDFBox 2.0.3
 Key: TIKA-2082
 URL: https://issues.apache.org/jira/browse/TIKA-2082
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.13
Reporter: Luis Filipe Nassif
 Fix For: 1.14


PDFBox 2.0.3 was released with a number of fixes. Tika should upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Plans for the first Tika 2.0 release

2016-09-19 Thread Sergey Beryozkin

Hi All

Back in May I updated one of our CXF demos on the master 3.2 branch to 
depend on Tika 2.0 SNAPSHOT to verify the new module system works well.
It is feasible that CXF 3.2.0 may be released by the end of the year or 
early next year.
As far as Tika 2.0 dependencies are concerned it will be easy for me to 
update the demo to temporarily depend on Tika 1.13 or 1.14. But if Tika 
2.0 is released by the time CXF 3.2 is about to be released then I'll be 
happy to keep 2.0 deps.

Are there any plans to get Tika 2.0 out in the next few months ?

Cheers, Sergey