[jira] [Comment Edited] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866116#comment-13866116 ] Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:39 AM: - OK Peter lets work on this. I confirm that I am also getting compile errors. I'll push to your github branch and we can take it from there. Thank you was (Author: lewismc): OK Peter lets work on this > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866080#comment-13866080 ] Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:35 AM: - Hi [~p_ansell], I have been working on a patch for this issue... which I did not wish to push it to Jira... however I've been taken off course by bugs in a Gora branch and I would like for us all (Any23 team) to propose this... if possible. I attach a patch for migrating Any23 mime package to Tika which retains the Purifier concept of cleaning documents prior to them being processed for mime/mediaType detection. I've not touched the Tika API or the Dectect API within this implementation as (I personally think) it would be more of a task to succeed in the code migration if we attempt to change well known and well designed 'detect' and base 'Tika' API's e.g adding a Purifier parameter to method construction. This therefore means that if we are to retain the concept of the Purifier interface, then implementations are detector specific... right now all we can offer (from Any23) is the WhiteSpacePurifier which is OK... but implementing the functionality in this manner is NOT configurable e.g. if someone wished to pass a custom Purifier as a parameter to detect(InputStream, Metedata, Purifier). I personally think that if other Purifier's were to be introduced then we could revisit this issue and possibly propose a change to various Tika interfaces so that detectors are parameter-aware of Purifier's. Apart from that, this (WIP) patch introduces an Any23Detector which basically stems from the TikaMIMETypeDetector we maintained in Any23... please comment on this as I am not sure if this is the right way to process... there are most likely issues with the implementation I have coded. THIS PATCH IS MERELY A START... I would really appreciate input from the Any23 team to see if I am 'attempting' to implement the Any23 mime code in the correct way that we think is suitable for migration to tika-core. It should also be noted that the last time I ran this patch with Tika trunk there were issues with detection of 'semantic' mime types. Hopefully this is a start which we can build from. I am committed to getting this code suitable for proposal to Tika. N.B. This patch also addresses ALL this Java elements that cause a warnings from within the entire codebase, so it looks like a lot more than it actually is. Any comments are VERY much appreciated. was (Author: lewismc): Hi [~p_ansell], I have been working on a patch for this issue... which I did not wish to push ti Jira... however I've been taken off course by bugs in a Gora branch. I attach a patch for migrating Any23 mime package to Tika which retains the Purifier concept of cleaning documents prior to them being processed for mime/mediaType detection. I've not touched the Tika API or the Dectect API within this implementation as (I personally think) it would be more of a task to succeed in the code migration if we attempt to change well know and well designed 'dectect' and base 'Tika' API's. This therefore means that Purifier implementations are detector specific... right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it NOT configurable e.g. if someone wished to pass a Purifier as a parameter to detect(InputStream, Metedata, Purifier) ... and I think that if other Purifier's were to be introduced then we could revisit this issue. Apart from that, this (WIP) patch introduces an Any23Detector which basically stems from the Tika detector we maintained in Any23... please comment on this as I am not sure if this is the right way to process... THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 'attempting' to implement the Any23 mime code in the correct way. It should also be noted that the last time I ran this patch with Tika trunk there were issues with detection of 'semantic' mime types. Hopefully this is a start which we can build from. I am committed to getting this code suitable for proposal to Tika. N.B. This patch also addresses ALL this Java elements that cause a warnings from within the entire codebase. Any comment are VERY appreciated. > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf
[jira] [Commented] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866116#comment-13866116 ] Lewis John McGibbney commented on TIKA-1208: OK Peter lets work on this > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866080#comment-13866080 ] Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:28 AM: - Hi [~p_ansell], I have been working on a patch for this issue... which I did not wish to push ti Jira... however I've been taken off course by bugs in a Gora branch. I attach a patch for migrating Any23 mime package to Tika which retains the Purifier concept of cleaning documents prior to them being processed for mime/mediaType detection. I've not touched the Tika API or the Dectect API within this implementation as (I personally think) it would be more of a task to succeed in the code migration if we attempt to change well know and well designed 'dectect' and base 'Tika' API's. This therefore means that Purifier implementations are detector specific... right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it NOT configurable e.g. if someone wished to pass a Purifier as a parameter to detect(InputStream, Metedata, Purifier) ... and I think that if other Purifier's were to be introduced then we could revisit this issue. Apart from that, this (WIP) patch introduces an Any23Detector which basically stems from the Tika detector we maintained in Any23... please comment on this as I am not sure if this is the right way to process... THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 'attempting' to implement the Any23 mime code in the correct way. It should also be noted that the last time I ran this patch with Tika trunk there were issues with detection of 'semantic' mime types. Hopefully this is a start which we can build from. I am committed to getting this code suitable for proposal to Tika. N.B. This patch also addresses ALL this Java elements that cause a warnings from within the entire codebase. Any comment are VERY appreciated. was (Author: lewismc): Hi [~p_ansell], I have been working on a patch for this issue... which I did not wish to push ti Jira... however I've been taken off course by bugs in a Gora branch. I attach a patch for migrating Any23 mime package to Tika which retains the Purifier concept of cleaning documents prior to them being processed for mime/mediaType detection. I've not touched the Tika API or the Dectect API within this implementation as (I personally think) it would be more of a task to succeed in the code migration if we attempt to change well know and well designed 'dectect' and base 'Tika' API's. This therefore means that Purifier implementations are detector specific... right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it NOT configurable e.g. if someone wished to pass a Purifier as a parameter to detect(InputStream, Metedata, Purifier) ... and I think that if other Purifier's were to be introduced then we could revisit this issue. Apart from that, this (WIP) patch introduces an Any23Detector which basically stems from the Tika detector we maintained in Any23... please comment on this as I am not sure if this is the right way to process... THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 'attempting' to implement the Any23 mime code in the correct way. It should also be noted that the last time I ran this patch with Tika trunk there were issues with detection of 'semantic' mime types. Hopefully this is a start which we can build from. I am committed to getting this code suitable for proposal to Tika. Any comment are VERY appreciated. > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866101#comment-13866101 ] Peter Ansell commented on TIKA-1208: The patch applies cleanly to the current trunk but it doesn't compile: [INFO] Compiling 40 source files to /home/ans025/gitrepos/tika/tika-core/target/test-classes [INFO] - [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /home/ans025/gitrepos/tika/tika-core/src/test/java/org/apache/tika/detect/Any23DetectorTest.java:[432,66] error: cannot find symbol [ERROR] class Any23DetectorTest /home/ans025/gitrepos/tika/tika-core/src/test/java/org/apache/tika/detect/Any23DetectorTest.java:[448,37] error: cannot find symbol [INFO] 2 errors I am not sure what the two broken lines should be changed to, as I am not familiar with the Tika codebase at this point. I have put the patch on GitHub to work on it if that is easier for you (you are a collaborator on the repository): https://github.com/ansell/tika/tree/TIKA-1208 > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-1208: --- Attachment: TIKA-1208.patch Hi [~p_ansell], I have been working on a patch for this issue... which I did not wish to push ti Jira... however I've been taken off course by bugs in a Gora branch. I attach a patch for migrating Any23 mime package to Tika which retains the Purifier concept of cleaning documents prior to them being processed for mime/mediaType detection. I've not touched the Tika API or the Dectect API within this implementation as (I personally think) it would be more of a task to succeed in the code migration if we attempt to change well know and well designed 'dectect' and base 'Tika' API's. This therefore means that Purifier implementations are detector specific... right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it NOT configurable e.g. if someone wished to pass a Purifier as a parameter to detect(InputStream, Metedata, Purifier) ... and I think that if other Purifier's were to be introduced then we could revisit this issue. Apart from that, this (WIP) patch introduces an Any23Detector which basically stems from the Tika detector we maintained in Any23... please comment on this as I am not sure if this is the right way to process... THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 'attempting' to implement the Any23 mime code in the correct way. It should also be noted that the last time I ran this patch with Tika trunk there were issues with detection of 'semantic' mime types. Hopefully this is a start which we can build from. I am committed to getting this code suitable for proposal to Tika. Any comment are VERY appreciated. > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
Peter Ansell created TIKA-1217: -- Summary: Integrate with Java-7 FileTypeDetector API Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866050#comment-13866050 ] Peter Ansell commented on TIKA-1208: I don't think any new MIME types have been added since 2.7.5. Most of them were added in 2.7.0 but I think some of them may have been added in 2.7.1. Any23 should be fine to bump to the current release 2.7.9, as we have not to my knowledge added any new interface methods in the patch releases that would complicate the bump. 2.8.0 will be a bit of a bump, as it is where we are updating to RDF-1.1, but it is still in alpha form. > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.5 > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Extract thumbnail from openxml office files
Hi Hong-Thai, It’s certainly worth investigating. Several other formats can have embedded thumbnails as well so we could implement a generic thumbnail property. We could probably store as something like a Base64 encoded string, but we’d likely want to place limits on the size and may need a thumbnail internet media type field as well to assist in decoding. Unless others feel differently, I would say open a JIRA where we could start discussing the design of such a feature. Thanks! Ray On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen (hong-thai.ngu...@polyspot.com) wrote: > > Hi all, > I want to extract thumbnail image included in Open XML office > files. Apparently, we can do it by openxml4j: > http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx > > The question is : should we integrate thumbnail in default metadata > list of ooxml parsing result ? > > > Thanks > > Hong-Thai > >
Extract thumbnail from openxml office files
Hi all, I want to extract thumbnail image included in Open XML office files. Apparently, we can do it by openxml4j: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx The question is : should we integrate thumbnail in default metadata list of ooxml parsing result ? Thanks Hong-Thai