Re: 1.7 release?

2014-12-18 Thread Thomas Ledoux
Hi, it might be worth waiting until POI 3.11-FINAL is released so that the
TIKA release do not depend on a beta version. It's due on Sunday, corrects
a lot of old office parsing and just needs the patch in TIKA-1469 to
properly work.

Regards
  Thomas

2014-12-18 21:54 GMT+01:00 Tyler Palsulich :
>
> Hi All,
>
> It's been a few months, so I just want to follow up on this thread. We've
> resolved/closed 51 issues for v1.7 [0]. There are two on JIRA marked as 1.7
> (TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with TIKA-1445?
> Has anyone tried their hand at the suggested (significant) fix?
>
> Are there any other issues someone would like to fit in?
>
> Cheers,
> Tyler
>
> [0] -
>
> https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel
>
> On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> > Thanks Tim saw your patch and am looking now.
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++
> >
> >
> >
> >
> >
> >
> > -Original Message-
> > From: , "Timothy B." 
> > Reply-To: "dev@tika.apache.org" 
> > Date: Monday, October 27, 2014 at 12:30 PM
> > To: "dev@tika.apache.org" 
> > Subject: RE: 1.7 release?
> >
> > >Sounds good.  As long as the default behavior remains the same, I'm
> > >happy.  I'm going to play with a combination of your patch and Tyler's
> > >and see what the ramifications are for embedded docs.
> > >
> > >To confirm, the OCR integration is fantastic.  Thank you and Tyler!
> > >
> > >
> > >Best,
> > >
> > >   Tim
> > >
> > >-Original Message-
> > >From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > >Sent: Friday, October 24, 2014 5:36 PM
> > >To: dev@tika.apache.org
> > >Subject: Re: 1.7 release?
> > >
> > >Hey Tim,
> > >
> > >What do you think about my existing patch for 1445? For example to
> > >just call all the parsers? I thought I was seeing behavior that was
> > >slow because of that, but it turned out to be Tesseract and my machine
> > >at the time?
> > >
> > >I think my patch for 1445 may be enough, and we should get the metadata
> > >I think? Thoughts?
> > >
> > >I honestly think we need to deliver Tesseract in 1.7. We're close. I'll
> > >even take it upon myself to try and experiment with the idea of multiple
> > >parsers being called. I think a simple solution to the metadata key
> > >conflict issue is simply to have a policy to add values (by default) and
> > >replace if a property is set in ParseContext. Some simple updates to
> > >CompositeParser would allow this.
> > >
> > >Thoughts?
> > >
> > >Cheers,
> > >Chris
> > >
> > >
> > >++
> > >Chris Mattmann, Ph.D.
> > >Chief Architect
> > >Instrument Software and Science Data Systems Section (398)
> > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >Office: 168-519, Mailstop: 168-527
> > >Email: chris.a.mattm...@nasa.gov
> > >WWW:  http://sunset.usc.edu/~mattmann/
> > >++
> > >Adjunct Associate Professor, Computer Science Department
> > >University of Southern California, Los Angeles, CA 90089 USA
> > >++
> > >
> > >
> > >
> > >
> > >
> > >
> > >-Original Message-
> > >From: , "Timothy B." 
> > >Reply-To: "dev@tika.apache.org" 
> > >Date: Friday, October 24, 2014 at 2:24 PM
> > >To: "dev@tika.apache.org" 
> > >Subject: RE: 1.7 release?
> > >
> > >>Sorry for coming late to the game on the implications of TIKA-1445.  I
> > >>don't want to hold up the release of 1.7.
> > >>
> > >>However, would it be possible to return to the legacy default behavior
> of
> > >>extracting metadata from images?
> > >>
> > >>We can then document on the OCR parser page on the wiki that you need
> to
> > >>install Tesseract _and_ make a change in the parser/mime config file.
> If
> > >>you want this new capability, it will take a small bit of work until we
> > >>solve TIKA-1445.
> > >>
> > >>I worry that the current behavior of 1.7 would be surprising to most
> > >>non-dev users (well, even to at least one dev :) ).
> > >>
> > >>Cheers,
> > >>
> > >>  Tim
> > >>
> > >>
> > >>From: Oleg Tikhonov [olegtikho...@gmail.com]
> > >>Sent: Friday, Octo

Re: 1.7 release?

2014-12-22 Thread Thomas Ledoux
+1 for going.
Many thanks to Tyler and to Nick to take the POI upgrade.

So many christmas gifts in advance or just after :-)

Merry christmas to all

2014-12-22 19:59 GMT+01:00 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> WOOO HOO! Go Tyler go! :0) Merry Christmas bud.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Tyler Palsulich 
> Reply-To: "dev@tika.apache.org" 
> Date: Monday, December 22, 2014 at 10:57 AM
> To: "dev@tika.apache.org" 
> Subject: Re: 1.7 release?
>
> >Hi All,
> >
> >Nick added the temporary fix for TIKA-1445 and made the POI updates for
> >TIKA-1469 (thanks!). And, I'll volunteer to be the Release Manager for
> >1.7!
> >:)
> >
> >I'll start the process this weekend or a couple days into the new year.
> >
> >Cheers,
> >Tyler
> >On Dec 18, 2014 9:45 PM, "Mattmann, Chris A (3980)" <
> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >> +1
> >>
> >> ++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Tyler Palsulich 
> >> Reply-To: "dev@tika.apache.org" 
> >> Date: Thursday, December 18, 2014 at 9:15 PM
> >> To: "dev@tika.apache.org" 
> >> Subject: Re: 1.7 release?
> >>
> >> >I'm OK with trying the fix in 1.8 (or 1.7 if people feel strongly). As
> >> >Nick
> >> >just recommended, I'll try adding metadata extraction to Tesseract
> >>soon,
> >> >then adding the extensible solution in 1.8.
> >> >
> >> >Tyler
> >> >
> >> >On Thu, Dec 18, 2014 at 11:58 PM, Mattmann, Chris A (3980) <
> >> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >> >>
> >> >> I haven’t tried my hand at it - been super busy. tyler if you have a
> >> >> chance go for it, I think that’s the remaining blocker.
> >> >>
> >> >> ++
> >> >> Chris Mattmann, Ph.D.
> >> >> Chief Architect
> >> >> Instrument Software and Science Data Systems Section (398)
> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> >> Office: 168-519, Mailstop: 168-527
> >> >> Email: chris.a.mattm...@nasa.gov
> >> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> >> ++
> >> >> Adjunct Associate Professor, Computer Science Department
> >> >> University of Southern California, Los Angeles, CA 90089 USA
> >> >> ++
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> -Original Message-
> >> >> From: Tyler Palsulich 
> >> >> Reply-To: "dev@tika.apache.org" 
> >> >> Date: Thursday, December 18, 2014 at 12:54 PM
> >> >> To: "dev@tika.apache.org" 
> >> >> Subject: Re: 1.7 release?
> >> >>
> >> >> >Hi All,
> >> >> >
> >> >> >It's been a few months, so I just want to follow up on this thread.
> >> >>We've
> >> >> >resolved/closed 51 issues for v1.7 [0]. There are two on JIRA
> >>marked as
> >> >> >1.7
> >> >> >(TIKA-1465 and TIKA-894). Do we still want to aim for 1.7 with
> >> >>TIKA-1445?
> >> >> >Has anyone tried their hand at the suggested (significant) fix?
> >> >> >
> >> >> >Are there any other issues someone would like to fit in?
> >> >> >
> >> >> >Cheers,
> >> >> >Tyler
> >> >> >
> >> >> >[0] -
> >> >> >
> >> >>
> >> >>
> >>
> >>
> https://issues.apache.org/jira/browse/TIKA/fixforversion/12327096/?select
> >> >>e
> >> >> >dTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel
> >> >> >
> >> >> >On Tue, Oct 28, 2014 at 1:46 AM, Mattmann, Chris A (3980) <
> >> >> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >> >> >>
> >> >> >> Thanks Tim saw your patch and am looking now.
> >> >> >>
> >> >> >> ++
> >> >> >> Chris Mattmann, Ph.D.
> >> >> >> Chief Architect
> >> >> >> Instrument Software and Science Data Systems Section (398)
> >> >> >> NASA Jet

Re: [VOTE] Apache Tika 1.7 Release

2015-01-14 Thread Thomas Ledoux
+1, works for me

2015-01-13 9:23 GMT+01:00 Tyler Palsulich :

> Hi Folks,
>
> Let's mark this RC#2 as failed and shift the vote to the updated RC#3 (
> http://markmail.org/message/m5gpgmr7hedgpjdj), which has Tesseract
> metadata
> fixes and David's test fix.
>
> Thanks,
> Tyler
>
> On Thu, Jan 8, 2015 at 6:25 AM, Peter Bowyer 
> wrote:
>
> > +1.
> >
> > Worked great once I manually
> > edited
> >
> tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
> > and set useNonSequentialParser to true
> >
> > Peter
> >
>


[jira] [Commented] (TIKA-1622) Expose Tika LanguageIdentifier via Tika Server

2015-05-05 Thread Thomas Ledoux (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528383#comment-14528383
 ] 

Thomas Ledoux commented on TIKA-1622:
-

A very minor observation to [~chrismattmann]: the french sentence is incorrect. 
There is no cedilla in the c in front of the 'i', in french; but it's needed in 
front of the 'a'. The sentence should be: "comme ci comme ça". Many thanks for 
the great job ;-)

> Expose Tika LanguageIdentifier via Tika Server
> --
>
> Key: TIKA-1622
> URL: https://issues.apache.org/jira/browse/TIKA-1622
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.9
>
>
> The LanguageIdentifier in Tika should be exposed via Tika JAX-RS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1622) Expose Tika LanguageIdentifier via Tika Server

2015-05-12 Thread Thomas Ledoux (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Ledoux updated TIKA-1622:

Attachment: TIKA-1622-commeci.patch

Hi, here is the trivial patch. Hope it applies.

> Expose Tika LanguageIdentifier via Tika Server
> --
>
> Key: TIKA-1622
> URL: https://issues.apache.org/jira/browse/TIKA-1622
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.9
>
> Attachments: TIKA-1622-commeci.patch
>
>
> The LanguageIdentifier in Tika should be exposed via Tika JAX-RS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1622) Expose Tika LanguageIdentifier via Tika Server

2015-05-14 Thread Thomas Ledoux (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Ledoux updated TIKA-1622:

Attachment: TIKA-1622-cestcommeci.patch

Apologies for not testing my first patch.
I suppose the confusion comes for the italian 'come ci'. Nevertheless, giving a 
like more context does work. 
So here is a second patch for that, that do work ...
{code}
---
 T E S T S
---
Running org.apache.tika.server.LanguageResourceTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.734 sec - in 
org.apache.tika.server.LanguageResourceTest

Results :

Tests run: 4, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
{code}

> Expose Tika LanguageIdentifier via Tika Server
> --
>
> Key: TIKA-1622
> URL: https://issues.apache.org/jira/browse/TIKA-1622
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier, server
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.9
>
> Attachments: TIKA-1622-cestcommeci.patch, TIKA-1622-commeci.patch
>
>
> The LanguageIdentifier in Tika should be exposed via Tika JAX-RS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-08 Thread Thomas Ledoux (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895528#comment-13895528
 ] 

Thomas Ledoux commented on TIKA-1232:
-

Regarding XMP ouput from tika and the inclusion of version, in the case of PDF, 
special ontologies are defined.
Namely, in the http://wwwns.adobe.com/pdf/1.3/ namespace, there is a 
pdf:PDFVersion property.
It can even be refined in the case of PDF/A where the conformance level can be 
given using the http://www.aiim.org/pdfa/ns/id/ namespace in the property 
pdfaid:conformance (see TN0008). There are similar properties 
pdfx:GTS_PDFXVersion and pdfx:GTS_PDFXConformance in the 
http://ns.adobe.com/pdfx/1.3 namespace for PDF/X files.

However, all these properties are only available for PDF formats and will break 
the idea of having a generic metadata map exposed by tika.
So I agree with Andrew proposal of using a "version" parameter in the mimetype, 
which is allowed in XMP.
Indeed, the XMP definition of the value of dc:format is a MIMEType following 
IETF RFC 2045 section 5.1. 

Finally, in order to prevent the confusion of client code that Andrew raises, 
we could take advantage of the repeatability of the dc:format attribute and 
output 2 dc:formats : the first being the "normal" Content-Type and the second 
being the Extended-Content-Type.


> Add PDF version to PDFParser output
> ---
>
> Key: TIKA-1232
> URL: https://issues.apache.org/jira/browse/TIKA-1232
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
> Environment: JDK6
>Reporter: William Palmer
>Assignee: Tim Allison
>Priority: Minor
> Attachments: pdfversion.patch
>
>
> I'd like to identify the PDF version of files, this is not currently reported 
> by the PDFParser although the information is available via PDFBox.  I have 
> attached a patch that adds the format version to the Metadata object.
> However, I am not familiar enough with the Tika source to know if an 
> alternative metadata key should be used, or this new one added.
> Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1467) pdf:encrypted:false with encrypted pdf

2014-11-07 Thread Thomas Ledoux (JIRA)
Thomas Ledoux created TIKA-1467:
---

 Summary: pdf:encrypted:false with encrypted pdf
 Key: TIKA-1467
 URL: https://issues.apache.org/jira/browse/TIKA-1467
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: $java -version
java version "1.6.0_25"
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing)
Reporter: Thomas Ledoux


When extracting metadata from the encryption_noprinting.pdf file found in the 
pdfCabinetOfHorrors 
(https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors)

$java -jar tika-app-1.7-20141105.092424-471.jar -j encryption_noprinting.pdf

We get a 
INFO - Document is encrypted

but the resulting JSON has : "pdf:encrypted":"false"

Looking at the PDFParser, it seems that the first information comes when 
reading the PDF but when the metadata is retrieve the PDF is no longer 
encrypted... the encryption fact should be retain to be added to the metadata.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Thomas Ledoux (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252443#comment-14252443
 ] 

Thomas Ledoux commented on TIKA-1497:
-

Any chance we get XMP by asking for application/rdf+xml responses.
I know it is one more case to track but it would really be a nice feature. 

Thanks, Thomas.

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch, TIKA-1497v2.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1497) tika-server cannot output JSON

2014-12-18 Thread Thomas Ledoux (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14253109#comment-14253109
 ] 

Thomas Ledoux commented on TIKA-1497:
-

[~talli...@apache.org], that was fast and efficient. Very much appreciated. 
Thanks

> tika-server cannot output JSON
> --
>
> Key: TIKA-1497
> URL: https://issues.apache.org/jira/browse/TIKA-1497
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Peter Bowyer
> Attachments: TIKA-1497.patch, TIKA-1497v2.patch
>
>
> I would like the response from 
> curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
> to be JSON and not CSV?.
> I've discovered JSONMessageBodyWriter.java 
> (https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
>  so I think the functionality is present, tried adding --header "Accept: 
> application/json" to the cURL call, in line with the documentation for 
> outputting CSV, but no luck so far.
> According to [~sergey_beryozkin]
> "I see MetadataResource returning StreamingOutput and it has 
> @Produces(text/csv) only. As such this MBW has no effect at the moment.
> We can update MetadataResource to return Metadata directly if 
> application/json is requested or update MetadataResource to directly convert 
> Metadata to JSON in case of JSON being accepted."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-4028) Add detection for common subtitle format

2023-05-04 Thread Thomas Ledoux (Jira)
Thomas Ledoux created TIKA-4028:
---

 Summary: Add detection for common subtitle format
 Key: TIKA-4028
 URL: https://issues.apache.org/jira/browse/TIKA-4028
 Project: Tika
  Issue Type: Improvement
  Components: core
Reporter: Thomas Ledoux


Add detection and report of format for common subtitles formats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)