Re: [DISCUSS] Prepare Release 1.5?

2014-01-09 Thread David Meikle
Hi, 

On 29 Dec 2013, at 11:41, David Meikle loo...@gmail.com wrote:

 Hi Guys,
 
 There have been some questions pop up around when a new 1.5 release will be 
 available.
 
 I have some free cycles over the next couple of weeks to prepare one and I 
 believe Chris has some too, so in preparation for that what do we need to do 
 to make the current trunk releasable as version 1.5?
 
 For me the following issue need to be fixed before release:
 TIKA-1198 - the change to using multi-parts appears to have broken our 
 current guidance on usage significantly.
 
 Is there anything else others think is a must before rolling a release? 
 
 I was also thinking we could do some quick work to include the following 
 issues:
 TIKA-1059
 TIKA-985, TIKA-980
 
 I don’t want to hold things up, so if we sort peoples mandatories I think we 
 should roll a release. 
 
 @Chris - I know you had free cycles and volunteered so will defer to you on 
 the release management side of things.  That said happy to take it on if that 
 helps.
 
 Cheers,
 Dave

Conscious it was the festive period of late, so wondering if anyone has had 
further thoughts on this?

Cheers,
Dave

[jira] [Reopened] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-09 Thread Sumeet Gorab (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet Gorab reopened TIKA-1216:



Hi Tim Allison

Reported bug is not the duplicate of TIKA-1215, becasue in TIKA-1215 parse 
method gives exception but in TIKA-1216 there is no exception during execution.


Thanks  Regards
Sumeet Gorab


 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Fix For: 1.5

 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
Hi Ray  all,

By searching on issues, I found the issue already created: 
https://issues.apache.org/jira/browse/TIKA-90
It's maybe now the time to realize it.

Thanks,

Hong-Thai

-Message d'origine-
De : Ray Gauss II [mailto:ray.ga...@alfresco.com] 
Envoyé : mercredi 8 janvier 2014 11:49
À : dev@tika.apache.org
Objet : Re: Extract thumbnail from openxml office files

Hi Hong-Thai,

It’s certainly worth investigating.  Several other formats can have embedded 
thumbnails as well so we could implement a generic thumbnail property.

We could probably store as something like a Base64 encoded string, but we’d 
likely want to place limits on the size and may need a thumbnail internet media 
type field as well to assist in decoding.

Unless others feel differently, I would say open a JIRA where we could start 
discussing the design of such a feature.

Thanks!

Ray


On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen 
(hong-thai.ngu...@polyspot.com) wrote:
  
 Hi all,
 I want to extract thumbnail image included in Open XML office files. 
 Apparently, we can do it by openxml4j: 
 http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21
 /openxmlandjava.aspx The question is : should we integrate thumbnail 
 in default metadata list of ooxml parsing result ?
  
  
 Thanks
  
 Hong-Thai
  
  



[jira] [Commented] (TIKA-90) Allow thumbnails as document metadata

2014-01-09 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866498#comment-13866498
 ] 

Hong-Thai Nguyen commented on TIKA-90:
--

Useful for Open XML Office  OpenOffice files and some others with embedded 
thumbnail.

 Allow thumbnails as document metadata
 -

 Key: TIKA-90
 URL: https://issues.apache.org/jira/browse/TIKA-90
 Project: Tika
  Issue Type: New Feature
  Components: general
Reporter: Jukka Zitting

 It would be nice if parser components could produce thumbnail images and 
 other non-string metadata when parsing documents.
 To do this, we could either generalize the current Metadata methods, or 
 introduce new methods for handling such non-string metadata.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:

By searching on issues, I found the issue already created: 
https://issues.apache.org/jira/browse/TIKA-90


I'm not sure if the metadata is the right place to return this. Some 
formats offer a small thumbnail, others can offer a small thumbnail for 
every page, and at least one can include a full-size image of the first 
page.


Would we not be better off exposing these embedded renderings via the 
existing embedded resources handling, with some sort of handy way to 
identify what something is (eg this is a full-size PNG of page 1, this is 
a jpg thumbnail of page 3)?


Nick


Re: [DISCUSS] Prepare Release 1.5?

2014-01-09 Thread Chris Mattmann
Hey Dave,

I kind of got bogged down and haven't had time to release. If someone
else does have time and wants to pick this up, +1 for it!

Cheers,
Chris




-Original Message-
From: David Meikle loo...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, January 9, 2014 3:46 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: [DISCUSS] Prepare Release 1.5?

Hi, 

On 29 Dec 2013, at 11:41, David Meikle loo...@gmail.com wrote:

 Hi Guys,
 
 There have been some questions pop up around when a new 1.5 release
will be available.
 
 I have some free cycles over the next couple of weeks to prepare one
and I believe Chris has some too, so in preparation for that what do we
need to do to make the current trunk releasable as version 1.5?
 
 For me the following issue need to be fixed before release:
 TIKA-1198 - the change to using multi-parts appears to have broken our
current guidance on usage significantly.
 
 Is there anything else others think is a must before rolling a release?
 
 I was also thinking we could do some quick work to include the
following issues:
 TIKA-1059
 TIKA-985, TIKA-980
 
 I don¹t want to hold things up, so if we sort peoples mandatories I
think we should roll a release.
 
 @Chris - I know you had free cycles and volunteered so will defer to
you on the release management side of things.  That said happy to take
it on if that helps.
 
 Cheers,
 Dave

Conscious it was the festive period of late, so wondering if anyone has
had further thoughts on this?

Cheers,
Dave




Re: Extract thumbnail from openxml office files

2014-01-09 Thread Mattmann, Chris A (398J)
Hi Hong-Thai,

+1 to using cardinality to help denote more complex metadata relationships
at least until we get past prior discussions on Metadata and name spacing.

See the wiki here for some prior past thoughts:
http://wiki.apache.org/tika/MetadataDiscussion


I know our met structure is simple -- it was purposefully designed that way
even though at the time very complex and hierarchical metadata structures
existed
and could have been leveraged but instead were not in favor of a simple
approach
, e.g., key mutli-value (note distinction between key value).

Thanks!

Cheers,
Chris



-Original Message-
From: Hong-Thai Nguyen hong-thai.ngu...@polyspot.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, January 9, 2014 8:36 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: RE: Extract thumbnail from openxml office files

Hi Nick,
You're begining a very interesting topic about foundation of our metadata
concept :)
I agree with you that metadata is not the best place to store thumbnail
result. Until now, our metadata is simple map with key:values. This
structure is not really flexiable in some cases. For exemple, we would
store author's information, each author has a first name and a last name.
Ideally, we could have some like struct:
Person:
   FirstName
   LastName

An other example is for our futur thumbnail. If we can have a metadata
'thumbnail' with hierarchical structure like:
Thumbnail:
   Dimension
   Width
   Length
   MimeType
   Extension
   Pages
   Description

That needs a huge refactoring about our core model. An other solution is
we can keep thumbnail result is a list Listbyte[] insteads of a single
value. An element is the thumbnail of a page. If the list has only 1
element, mean there's only thumbnail of the first page.

Hong-Thai

-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org]
Envoyé : jeudi 9 janvier 2014 12:11
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
 By searching on issues, I found the issue already created:
 https://issues.apache.org/jira/browse/TIKA-90

I'm not sure if the metadata is the right place to return this. Some
formats offer a small thumbnail, others can offer a small thumbnail for
every page, and at least one can include a full-size image of the first
page.

Would we not be better off exposing these embedded renderings via the
existing embedded resources handling, with some sort of handy way to
identify what something is (eg this is a full-size PNG of page 1, this is
a jpg thumbnail of page 3)?

Nick



RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
I agree with you that metadata is not the best place to store thumbnail 
result. Until now, our metadata is simple map with key:values. This 
structure is not really flexiable in some cases.


Currently, we have four kinds of things that we return for content:
 * Type
 * Metadata
 * Content, as xhtml
 * Any resources embedded in it (eg nested documents, images etc)

I'm not disputing that our Metadata setup could use some more work to make 
it richer (within reason!), what I'm not sure is that an expanded metadata 
system is the right place to put thumbnails and full-page renderings. 
Those feel a lot more like embedded resources to me


An other example is for our futur thumbnail. If we can have a metadata 
'thumbnail' with hierarchical structure like:


Thumbnail:
Dimension
Width
Length
MimeType
Extension
Pages
Description


If we returned the thumbnail as an embedded resource, you'd get the type + 
full metadata on the image (not just width/length), along with extension 
etc. If we had a common naming scheme for them, possibly with some custom 
metadata keys, we could return the page number it applies to, along with 
if it's a thumbnail or a full size rendering (some formats have one, the 
other, or both)


Are you able to explain how your scheme would be simpler and easier to use 
than returning them as embedded resources?


Nick


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
I'm convinced that using embedded resources is a better solution. Thank Nick
@Matt, I ignored that we had a reflect on metadata structure. Interesting.

We would adapt TIKA-90 title  description. I hope provide an initiative on 
this work.

Hong-Thai


-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : jeudi 9 janvier 2014 15:25
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
 I agree with you that metadata is not the best place to store 
 thumbnail result. Until now, our metadata is simple map with 
 key:values. This structure is not really flexiable in some cases.

Currently, we have four kinds of things that we return for content:
  * Type
  * Metadata
  * Content, as xhtml
  * Any resources embedded in it (eg nested documents, images etc)

I'm not disputing that our Metadata setup could use some more work to make it 
richer (within reason!), what I'm not sure is that an expanded metadata system 
is the right place to put thumbnails and full-page renderings. 
Those feel a lot more like embedded resources to me

 An other example is for our futur thumbnail. If we can have a metadata 
 'thumbnail' with hierarchical structure like:

 Thumbnail:
   Dimension
   Width
   Length
   MimeType
   Extension
   Pages
   Description

If we returned the thumbnail as an embedded resource, you'd get the type + full 
metadata on the image (not just width/length), along with extension etc. If we 
had a common naming scheme for them, possibly with some custom metadata keys, 
we could return the page number it applies to, along with if it's a thumbnail 
or a full size rendering (some formats have one, the other, or both)

Are you able to explain how your scheme would be simpler and easier to use than 
returning them as embedded resources?

Nick


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:

I'm convinced that using embedded resources is a better solution.


OK, sounds like we have a consensus and can go ahead with it, great!

One outstanding query is what name we should give to these when we return 
them as embedded resources, and if we should include a special key/value 
in the metadata that we send with them to identify them?


The source code for Alfresco has examples of extracting thumbnails and 
full images from a number of formats, along with tests. Firstly this could 
be a good source of inspiration of what formats to go for, and how to do 
it. Secondly, with a number of Alfrescans involved in the project, we 
might even be able to get the key bits of logic from the code + tests 
contributed into Tika, to speed things up :)


Nick


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866722#comment-13866722
 ] 

Jukka Zitting commented on TIKA-1217:
-

Nice idea!

I think putting such a feature to a separate tika-java7 component (included in 
the build only when using Java 7 or higher) for now is the best solution, as 
otherwise we'd need to raise the requirements on build environments. Once we do 
do that at some point in future, the component can be merged into tika-core.

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell

 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
Thank alot Nick, That's a great reference. BTW, may I'm wrong to say that 
thumbnail handling in Alfresco is quite complex because Alfresco can call 
external thumbnail generation with PDFBox or PDFRender  I'm defining DoD by 
retainning some main features from this in TIKA-90.
Could you guide me an example of returning embedded document in Tika parsers ?

Thanks

Hong-Thai


-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : jeudi 9 janvier 2014 15:49
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
 I'm convinced that using embedded resources is a better solution.

OK, sounds like we have a consensus and can go ahead with it, great!

One outstanding query is what name we should give to these when we return them 
as embedded resources, and if we should include a special key/value in the 
metadata that we send with them to identify them?

The source code for Alfresco has examples of extracting thumbnails and full 
images from a number of formats, along with tests. Firstly this could be a good 
source of inspiration of what formats to go for, and how to do it. Secondly, 
with a number of Alfrescans involved in the project, we might even be able to 
get the key bits of logic from the code + tests contributed into Tika, to speed 
things up :)

Nick


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Nick Burch

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
BTW, may I'm wrong to say that thumbnail handling in Alfresco is quite 
complex because Alfresco can call external thumbnail generation with 
PDFBox or PDFRender 


It can do, yes, but there are also dedicated classes to pull out most of 
the common thumbnails from common office formats that have them, that was 
the bit I had in mind referencing.


Could you guide me an example of returning embedded document in Tika 
parsers ?


To see the output side, your best bet is the -z option to Tika App. For 
the parser side, look at something like AbstractPOIFSExtractor (esp the 
handleEmbedded methods) or look at PackageParser (almost all the content 
from that is embedded resources)


Nick


[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866916#comment-13866916
 ] 

Tim Allison commented on TIKA-1216:
---

Agreed.  I didn't think this was a duplicate.  It is fixed, though, in trunk?  
If so, let's close this issue.

 parse method of Mp3Parser doesn't work for few mp3 files
 

 Key: TIKA-1216
 URL: https://issues.apache.org/jira/browse/TIKA-1216
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows 7 ultimate 32-bit OS, Java 1.7
Reporter: Sumeet Gorab
Priority: Blocker
  Labels: patch
 Fix For: 1.5

 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3


 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
 parse that mp3 file. Parse method is not able to complete its execution their 
 is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Peter Ansell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Ansell updated TIKA-1217:
---

Attachment: TIKA-1217.patch

Patch to add FileTypeDetector implementation

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Peter Ansell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867207#comment-13867207
 ] 

Peter Ansell commented on TIKA-1217:


Patch can also be reviewed at GitHub:

https://github.com/ansell/tika/compare/apache:trunk...ansell:TIKA-1217

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867228#comment-13867228
 ] 

Nick Burch commented on TIKA-1217:
--

Minor thing, but the section // Then open an InputStream if necessary would 
probably be more efficient if you used a File not a Stream. TikaInputStream 
will open the stream as needed, but for those things which need a File it'll be 
more efficient if the File is known (else it'll have to spool to a temp file)

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Peter Ansell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Ansell updated TIKA-1217:
---

Attachment: TIKA-1217-v2.patch

New version of patch checking File instead of InputStream

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217-v2.patch, TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-09 Thread Peter Ansell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867244#comment-13867244
 ] 

Peter Ansell commented on TIKA-1217:


Nick: New version of the patch uses Path.toFile() to get a file reference 
instead of Files.newInputStream.

 Integrate with Java-7 FileTypeDetector API
 --

 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell
 Attachments: TIKA-1217-v2.patch, TIKA-1217.patch


 It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
 implementations. Adding the corresponding 
 META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the 
 use of Files.probeContentType [2] without any specific links to Tika for this 
 functionality.
 If you do not want to rely on Java-7 for the core, then this could be added 
 as an extension module.
 [1] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
 [2] 
 http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)