Re: [DISCUSS] Prepare Release 1.5?
Hi, On 29 Dec 2013, at 11:41, David Meikle loo...@gmail.com wrote: Hi Guys, There have been some questions pop up around when a new 1.5 release will be available. I have some free cycles over the next couple of weeks to prepare one and I believe Chris has some too, so in preparation for that what do we need to do to make the current trunk releasable as version 1.5? For me the following issue need to be fixed before release: TIKA-1198 - the change to using multi-parts appears to have broken our current guidance on usage significantly. Is there anything else others think is a must before rolling a release? I was also thinking we could do some quick work to include the following issues: TIKA-1059 TIKA-985, TIKA-980 I don’t want to hold things up, so if we sort peoples mandatories I think we should roll a release. @Chris - I know you had free cycles and volunteered so will defer to you on the release management side of things. That said happy to take it on if that helps. Cheers, Dave Conscious it was the festive period of late, so wondering if anyone has had further thoughts on this? Cheers, Dave
[jira] [Reopened] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet Gorab reopened TIKA-1216: Hi Tim Allison Reported bug is not the duplicate of TIKA-1215, becasue in TIKA-1215 parse method gives exception but in TIKA-1216 there is no exception during execution. Thanks Regards Sumeet Gorab parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Fix For: 1.5 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
RE: Extract thumbnail from openxml office files
Hi Ray all, By searching on issues, I found the issue already created: https://issues.apache.org/jira/browse/TIKA-90 It's maybe now the time to realize it. Thanks, Hong-Thai -Message d'origine- De : Ray Gauss II [mailto:ray.ga...@alfresco.com] Envoyé : mercredi 8 janvier 2014 11:49 À : dev@tika.apache.org Objet : Re: Extract thumbnail from openxml office files Hi Hong-Thai, It’s certainly worth investigating. Several other formats can have embedded thumbnails as well so we could implement a generic thumbnail property. We could probably store as something like a Base64 encoded string, but we’d likely want to place limits on the size and may need a thumbnail internet media type field as well to assist in decoding. Unless others feel differently, I would say open a JIRA where we could start discussing the design of such a feature. Thanks! Ray On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen (hong-thai.ngu...@polyspot.com) wrote: Hi all, I want to extract thumbnail image included in Open XML office files. Apparently, we can do it by openxml4j: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21 /openxmlandjava.aspx The question is : should we integrate thumbnail in default metadata list of ooxml parsing result ? Thanks Hong-Thai
[jira] [Commented] (TIKA-90) Allow thumbnails as document metadata
[ https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866498#comment-13866498 ] Hong-Thai Nguyen commented on TIKA-90: -- Useful for Open XML Office OpenOffice files and some others with embedded thumbnail. Allow thumbnails as document metadata - Key: TIKA-90 URL: https://issues.apache.org/jira/browse/TIKA-90 Project: Tika Issue Type: New Feature Components: general Reporter: Jukka Zitting It would be nice if parser components could produce thumbnail images and other non-string metadata when parsing documents. To do this, we could either generalize the current Metadata methods, or introduce new methods for handling such non-string metadata. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
RE: Extract thumbnail from openxml office files
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: By searching on issues, I found the issue already created: https://issues.apache.org/jira/browse/TIKA-90 I'm not sure if the metadata is the right place to return this. Some formats offer a small thumbnail, others can offer a small thumbnail for every page, and at least one can include a full-size image of the first page. Would we not be better off exposing these embedded renderings via the existing embedded resources handling, with some sort of handy way to identify what something is (eg this is a full-size PNG of page 1, this is a jpg thumbnail of page 3)? Nick
Re: [DISCUSS] Prepare Release 1.5?
Hey Dave, I kind of got bogged down and haven't had time to release. If someone else does have time and wants to pick this up, +1 for it! Cheers, Chris -Original Message- From: David Meikle loo...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, January 9, 2014 3:46 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: [DISCUSS] Prepare Release 1.5? Hi, On 29 Dec 2013, at 11:41, David Meikle loo...@gmail.com wrote: Hi Guys, There have been some questions pop up around when a new 1.5 release will be available. I have some free cycles over the next couple of weeks to prepare one and I believe Chris has some too, so in preparation for that what do we need to do to make the current trunk releasable as version 1.5? For me the following issue need to be fixed before release: TIKA-1198 - the change to using multi-parts appears to have broken our current guidance on usage significantly. Is there anything else others think is a must before rolling a release? I was also thinking we could do some quick work to include the following issues: TIKA-1059 TIKA-985, TIKA-980 I don¹t want to hold things up, so if we sort peoples mandatories I think we should roll a release. @Chris - I know you had free cycles and volunteered so will defer to you on the release management side of things. That said happy to take it on if that helps. Cheers, Dave Conscious it was the festive period of late, so wondering if anyone has had further thoughts on this? Cheers, Dave
Re: Extract thumbnail from openxml office files
Hi Hong-Thai, +1 to using cardinality to help denote more complex metadata relationships at least until we get past prior discussions on Metadata and name spacing. See the wiki here for some prior past thoughts: http://wiki.apache.org/tika/MetadataDiscussion I know our met structure is simple -- it was purposefully designed that way even though at the time very complex and hierarchical metadata structures existed and could have been leveraged but instead were not in favor of a simple approach , e.g., key mutli-value (note distinction between key value). Thanks! Cheers, Chris -Original Message- From: Hong-Thai Nguyen hong-thai.ngu...@polyspot.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, January 9, 2014 8:36 AM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: Extract thumbnail from openxml office files Hi Nick, You're begining a very interesting topic about foundation of our metadata concept :) I agree with you that metadata is not the best place to store thumbnail result. Until now, our metadata is simple map with key:values. This structure is not really flexiable in some cases. For exemple, we would store author's information, each author has a first name and a last name. Ideally, we could have some like struct: Person: FirstName LastName An other example is for our futur thumbnail. If we can have a metadata 'thumbnail' with hierarchical structure like: Thumbnail: Dimension Width Length MimeType Extension Pages Description That needs a huge refactoring about our core model. An other solution is we can keep thumbnail result is a list Listbyte[] insteads of a single value. An element is the thumbnail of a page. If the list has only 1 element, mean there's only thumbnail of the first page. Hong-Thai -Message d'origine- De : Nick Burch [mailto:apa...@gagravarr.org] Envoyé : jeudi 9 janvier 2014 12:11 À : dev@tika.apache.org Objet : RE: Extract thumbnail from openxml office files On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: By searching on issues, I found the issue already created: https://issues.apache.org/jira/browse/TIKA-90 I'm not sure if the metadata is the right place to return this. Some formats offer a small thumbnail, others can offer a small thumbnail for every page, and at least one can include a full-size image of the first page. Would we not be better off exposing these embedded renderings via the existing embedded resources handling, with some sort of handy way to identify what something is (eg this is a full-size PNG of page 1, this is a jpg thumbnail of page 3)? Nick
RE: Extract thumbnail from openxml office files
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: I agree with you that metadata is not the best place to store thumbnail result. Until now, our metadata is simple map with key:values. This structure is not really flexiable in some cases. Currently, we have four kinds of things that we return for content: * Type * Metadata * Content, as xhtml * Any resources embedded in it (eg nested documents, images etc) I'm not disputing that our Metadata setup could use some more work to make it richer (within reason!), what I'm not sure is that an expanded metadata system is the right place to put thumbnails and full-page renderings. Those feel a lot more like embedded resources to me An other example is for our futur thumbnail. If we can have a metadata 'thumbnail' with hierarchical structure like: Thumbnail: Dimension Width Length MimeType Extension Pages Description If we returned the thumbnail as an embedded resource, you'd get the type + full metadata on the image (not just width/length), along with extension etc. If we had a common naming scheme for them, possibly with some custom metadata keys, we could return the page number it applies to, along with if it's a thumbnail or a full size rendering (some formats have one, the other, or both) Are you able to explain how your scheme would be simpler and easier to use than returning them as embedded resources? Nick
RE: Extract thumbnail from openxml office files
I'm convinced that using embedded resources is a better solution. Thank Nick @Matt, I ignored that we had a reflect on metadata structure. Interesting. We would adapt TIKA-90 title description. I hope provide an initiative on this work. Hong-Thai -Message d'origine- De : Nick Burch [mailto:apa...@gagravarr.org] Envoyé : jeudi 9 janvier 2014 15:25 À : dev@tika.apache.org Objet : RE: Extract thumbnail from openxml office files On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: I agree with you that metadata is not the best place to store thumbnail result. Until now, our metadata is simple map with key:values. This structure is not really flexiable in some cases. Currently, we have four kinds of things that we return for content: * Type * Metadata * Content, as xhtml * Any resources embedded in it (eg nested documents, images etc) I'm not disputing that our Metadata setup could use some more work to make it richer (within reason!), what I'm not sure is that an expanded metadata system is the right place to put thumbnails and full-page renderings. Those feel a lot more like embedded resources to me An other example is for our futur thumbnail. If we can have a metadata 'thumbnail' with hierarchical structure like: Thumbnail: Dimension Width Length MimeType Extension Pages Description If we returned the thumbnail as an embedded resource, you'd get the type + full metadata on the image (not just width/length), along with extension etc. If we had a common naming scheme for them, possibly with some custom metadata keys, we could return the page number it applies to, along with if it's a thumbnail or a full size rendering (some formats have one, the other, or both) Are you able to explain how your scheme would be simpler and easier to use than returning them as embedded resources? Nick
RE: Extract thumbnail from openxml office files
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: I'm convinced that using embedded resources is a better solution. OK, sounds like we have a consensus and can go ahead with it, great! One outstanding query is what name we should give to these when we return them as embedded resources, and if we should include a special key/value in the metadata that we send with them to identify them? The source code for Alfresco has examples of extracting thumbnails and full images from a number of formats, along with tests. Firstly this could be a good source of inspiration of what formats to go for, and how to do it. Secondly, with a number of Alfrescans involved in the project, we might even be able to get the key bits of logic from the code + tests contributed into Tika, to speed things up :) Nick
[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866722#comment-13866722 ] Jukka Zitting commented on TIKA-1217: - Nice idea! I think putting such a feature to a separate tika-java7 component (included in the build only when using Java 7 or higher) for now is the best solution, as otherwise we'd need to raise the requirements on build environments. Once we do do that at some point in future, the component can be merged into tika-core. Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
RE: Extract thumbnail from openxml office files
Thank alot Nick, That's a great reference. BTW, may I'm wrong to say that thumbnail handling in Alfresco is quite complex because Alfresco can call external thumbnail generation with PDFBox or PDFRender I'm defining DoD by retainning some main features from this in TIKA-90. Could you guide me an example of returning embedded document in Tika parsers ? Thanks Hong-Thai -Message d'origine- De : Nick Burch [mailto:apa...@gagravarr.org] Envoyé : jeudi 9 janvier 2014 15:49 À : dev@tika.apache.org Objet : RE: Extract thumbnail from openxml office files On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: I'm convinced that using embedded resources is a better solution. OK, sounds like we have a consensus and can go ahead with it, great! One outstanding query is what name we should give to these when we return them as embedded resources, and if we should include a special key/value in the metadata that we send with them to identify them? The source code for Alfresco has examples of extracting thumbnails and full images from a number of formats, along with tests. Firstly this could be a good source of inspiration of what formats to go for, and how to do it. Secondly, with a number of Alfrescans involved in the project, we might even be able to get the key bits of logic from the code + tests contributed into Tika, to speed things up :) Nick
RE: Extract thumbnail from openxml office files
On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote: BTW, may I'm wrong to say that thumbnail handling in Alfresco is quite complex because Alfresco can call external thumbnail generation with PDFBox or PDFRender It can do, yes, but there are also dedicated classes to pull out most of the common thumbnails from common office formats that have them, that was the bit I had in mind referencing. Could you guide me an example of returning embedded document in Tika parsers ? To see the output side, your best bet is the -z option to Tika App. For the parser side, look at something like AbstractPOIFSExtractor (esp the handleEmbedded methods) or look at PackageParser (almost all the content from that is embedded resources) Nick
[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866916#comment-13866916 ] Tim Allison commented on TIKA-1216: --- Agreed. I didn't think this was a duplicate. It is fixed, though, in trunk? If so, let's close this issue. parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Fix For: 1.5 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Ansell updated TIKA-1217: --- Attachment: TIKA-1217.patch Patch to add FileTypeDetector implementation Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867207#comment-13867207 ] Peter Ansell commented on TIKA-1217: Patch can also be reviewed at GitHub: https://github.com/ansell/tika/compare/apache:trunk...ansell:TIKA-1217 Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867228#comment-13867228 ] Nick Burch commented on TIKA-1217: -- Minor thing, but the section // Then open an InputStream if necessary would probably be more efficient if you used a File not a Stream. TikaInputStream will open the stream as needed, but for those things which need a File it'll be more efficient if the File is known (else it'll have to spool to a temp file) Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Ansell updated TIKA-1217: --- Attachment: TIKA-1217-v2.patch New version of patch checking File instead of InputStream Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217-v2.patch, TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867244#comment-13867244 ] Peter Ansell commented on TIKA-1217: Nick: New version of the patch uses Path.toFile() to get a file reference instead of Files.newInputStream. Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217-v2.patch, TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)