[jira] [Comment Edited] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866116#comment-13866116
 ] 

Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:39 AM:
-

OK Peter lets work on this. I confirm that I am also getting compile errors. 
I'll push to your github branch and we can take it from there. Thank you 


was (Author: lewismc):
OK Peter lets work on this

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866080#comment-13866080
 ] 

Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:35 AM:
-

Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push it to Jira... however I've been taken off course by bugs in a 
Gora branch and I would like for us all (Any23 team) to propose this... if 
possible.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well known and well 
designed 'detect' and base 'Tika' API's e.g adding a Purifier parameter to 
method construction.

This therefore means that if we are to retain the concept of the Purifier 
interface, then implementations are detector specific... right now all we can 
offer (from Any23) is the WhiteSpacePurifier which is OK... but implementing 
the functionality in this manner is NOT configurable e.g. if someone wished to 
pass a custom Purifier as a parameter to detect(InputStream, Metedata, 
Purifier). I personally think that if other Purifier's were to be introduced 
then we could revisit this issue and possibly propose a change to various Tika 
interfaces so that detectors are parameter-aware of Purifier's.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the TikaMIMETypeDetector we maintained in Any23... please comment on 
this as I am not sure if this is the right way to process... there are most 
likely issues with the implementation I have coded.  

THIS PATCH IS MERELY A START... I would really appreciate input from the Any23 
team to see if I am 'attempting' to implement the Any23 mime code in the 
correct way that we think is suitable for migration to tika-core.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types. 

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

N.B. This patch also addresses ALL this Java elements that cause a warnings 
from within the entire codebase, so it looks like a lot more than it actually 
is.

Any comments are VERY much appreciated.   


was (Author: lewismc):
Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push ti Jira... however I've been taken off course by bugs in a 
Gora branch.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well know and well 
designed 'dectect' and base 'Tika' API's.

This therefore means that Purifier implementations are detector specific... 
right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it 
NOT configurable e.g. if someone wished to pass a Purifier as a parameter to 
detect(InputStream, Metedata, Purifier) ... and I think that if other 
Purifier's were to be introduced then we could revisit this issue.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the Tika detector we maintained in Any23... please comment on this 
as I am not sure if this is the right way to process...   

THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 
'attempting' to implement the Any23 mime code in the correct way.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types.

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

N.B. This patch also addresses ALL this Java elements that cause a warnings 
from within the entire codebase. 

Any comment are VERY appreciated.   

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf

[jira] [Commented] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866116#comment-13866116
 ] 

Lewis John McGibbney commented on TIKA-1208:


OK Peter lets work on this

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866080#comment-13866080
 ] 

Lewis John McGibbney edited comment on TIKA-1208 at 1/9/14 12:28 AM:
-

Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push ti Jira... however I've been taken off course by bugs in a 
Gora branch.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well know and well 
designed 'dectect' and base 'Tika' API's.

This therefore means that Purifier implementations are detector specific... 
right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it 
NOT configurable e.g. if someone wished to pass a Purifier as a parameter to 
detect(InputStream, Metedata, Purifier) ... and I think that if other 
Purifier's were to be introduced then we could revisit this issue.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the Tika detector we maintained in Any23... please comment on this 
as I am not sure if this is the right way to process...   

THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 
'attempting' to implement the Any23 mime code in the correct way.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types.

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

N.B. This patch also addresses ALL this Java elements that cause a warnings 
from within the entire codebase. 

Any comment are VERY appreciated.   


was (Author: lewismc):
Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push ti Jira... however I've been taken off course by bugs in a 
Gora branch.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well know and well 
designed 'dectect' and base 'Tika' API's.

This therefore means that Purifier implementations are detector specific... 
right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it 
NOT configurable e.g. if someone wished to pass a Purifier as a parameter to 
detect(InputStream, Metedata, Purifier) ... and I think that if other 
Purifier's were to be introduced then we could revisit this issue.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the Tika detector we maintained in Any23... please comment on this 
as I am not sure if this is the right way to process...   

THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 
'attempting' to implement the Any23 mime code in the correct way.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types.

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

Any comment are VERY appreciated.   

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Peter Ansell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866101#comment-13866101
 ] 

Peter Ansell commented on TIKA-1208:


The patch applies cleanly to the current trunk but it doesn't compile:

[INFO] Compiling 40 source files to 
/home/ans025/gitrepos/tika/tika-core/target/test-classes
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
/home/ans025/gitrepos/tika/tika-core/src/test/java/org/apache/tika/detect/Any23DetectorTest.java:[432,66]
 error: cannot find symbol
[ERROR]  class Any23DetectorTest
/home/ans025/gitrepos/tika/tika-core/src/test/java/org/apache/tika/detect/Any23DetectorTest.java:[448,37]
 error: cannot find symbol
[INFO] 2 errors 

I am not sure what the two broken lines should be changed to, as I am not 
familiar with the Tika codebase at this point.

I have put the patch on GitHub to work on it if that is easier for you (you are 
a collaborator on the repository):

https://github.com/ansell/tika/tree/TIKA-1208

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-1208:
---

Attachment: TIKA-1208.patch

Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push ti Jira... however I've been taken off course by bugs in a 
Gora branch.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well know and well 
designed 'dectect' and base 'Tika' API's.

This therefore means that Purifier implementations are detector specific... 
right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it 
NOT configurable e.g. if someone wished to pass a Purifier as a parameter to 
detect(InputStream, Metedata, Purifier) ... and I think that if other 
Purifier's were to be introduced then we could revisit this issue.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the Tika detector we maintained in Any23... please comment on this 
as I am not sure if this is the right way to process...   

THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 
'attempting' to implement the Any23 mime code in the correct way.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types.

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

Any comment are VERY appreciated.   

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1217) Integrate with Java-7 FileTypeDetector API

2014-01-08 Thread Peter Ansell (JIRA)
Peter Ansell created TIKA-1217:
--

 Summary: Integrate with Java-7 FileTypeDetector API
 Key: TIKA-1217
 URL: https://issues.apache.org/jira/browse/TIKA-1217
 Project: Tika
  Issue Type: New Feature
  Components: detector, mime
Reporter: Peter Ansell


It would be useful if Tika natively provided Java-7 FileTypeDetector [1] 
implementations. Adding the corresponding 
META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use 
of Files.probeContentType [2] without any specific links to Tika for this 
functionality.

If you do not want to rely on Java-7 for the core, then this could be added as 
an extension module.

[1] 
http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html
[2] 
http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-01-08 Thread Peter Ansell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866050#comment-13866050
 ] 

Peter Ansell commented on TIKA-1208:


I don't think any new MIME types have been added since 2.7.5. Most of them were 
added in 2.7.0 but I think some of them may have been added in 2.7.1.

Any23 should be fine to bump to the current release 2.7.9, as we have not to my 
knowledge added any new interface methods in the patch releases that would 
complicate the bump.

2.8.0 will be a bit of a bump, as it is where we are updating to RDF-1.1, but 
it is still in alpha form.

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.5
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Extract thumbnail from openxml office files

2014-01-08 Thread Ray Gauss II
Hi Hong-Thai,

It’s certainly worth investigating.  Several other formats can have embedded 
thumbnails as well so we could implement a generic thumbnail property.

We could probably store as something like a Base64 encoded string, but we’d 
likely want to place limits on the size and may need a thumbnail internet media 
type field as well to assist in decoding.

Unless others feel differently, I would say open a JIRA where we could start 
discussing the design of such a feature.

Thanks!

Ray


On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen 
(hong-thai.ngu...@polyspot.com) wrote:
>  
> Hi all,
> I want to extract thumbnail image included in Open XML office  
> files. Apparently, we can do it by openxml4j: 
> http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx
>   
> The question is : should we integrate thumbnail in default metadata  
> list of ooxml parsing result ?
>  
>  
> Thanks
>  
> Hong-Thai
>  
>  



Extract thumbnail from openxml office files

2014-01-08 Thread Hong-Thai Nguyen
Hi all,
I want to extract thumbnail image included in Open XML office files. 
Apparently, we can do it by openxml4j: 
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx
The question is : should we integrate thumbnail in default metadata list of 
ooxml parsing result ?


Thanks

Hong-Thai