[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660283#comment-14660283
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

I'm confused about Ray's Tika FFMPEG that you're talking about. Also, [~gostep] 
the stuff being talked about here (how to handle properties and typed values, 
names, etc.) is precisely is what you I think were trying to get at with your 
proposal and so forth so you should probably comment here. 

Good job on actually producing code for this [~talli...@apache.org] I'd like to 
take a look at it more before commenting further. One thing I know too is that 
the [OODT Metadata 
Object|https://oodt.jpl.nasa.gov/jira/si/jira.issueviews:issue-html/OODT-303/OODT-303.html]
 discussion that we had internally at JPL a long time ago is EXTREMELY similar 
to this one and it should be considered. I pointed [~lewismc] at this during 
the initial discussion.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Mattmann, Chris A (3980)
I think we may have exclusions here since they are test resources?

Not sure. Will check. Also thinking of upgrading to DRAT (instead of
RAT):

http://github.com/chrismattmann/drat/

See all the prezos, etc., there for why.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch apa...@gagravarr.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, August 6, 2015 at 12:43 PM
To: dev@tika.apache.org dev@tika.apache.org
Cc: Daniel Gruno humbed...@apache.org
Subject: Re: 1.10 release missing license headers noted by Daniel Gruno

On Thu, 6 Aug 2015, Mattmann, Chris A (3980) wrote:
 From Twitter:
 https://paste.apache.org/1CPH

 Don’t have to fix now, but would be good to fix for 1.11.

Don't we have Apache Creadur (formerly Rat) setup on the build? If so,
how 
did it pass? If not, can someone turn it on ASAP? :)

Nick



Re: 1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Nick Burch

On Thu, 6 Aug 2015, Mattmann, Chris A (3980) wrote:

I think we may have exclusions here since they are test resources?


The tika-parsers/src/test/resources/test-documents/ shouldn't have headers 
at all, and the txt.Charset ones are taken from Icu4j so have their 
original license header on them, but the rest should do



Not sure. Will check. Also thinking of upgrading to DRAT (instead of
RAT):

http://github.com/chrismattmann/drat/


Can't you just get voted onto the Creadur PMC then get that pushed 
upstream? :)


Nick


Re: 1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Mattmann, Chris A (3980)
-Original Message-

From: Nick Burch apa...@gagravarr.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, August 6, 2015 at 2:25 PM
To: dev@tika.apache.org dev@tika.apache.org
Cc: Daniel Gruno humbed...@apache.org
Subject: Re: 1.10 release missing license headers noted by Daniel Gruno


 Not sure. Will check. Also thinking of upgrading to DRAT (instead of
 RAT):

 http://github.com/chrismattmann/drat/

Can't you just get voted onto the Creadur PMC then get that pushed
upstream? :)

I thought about that - or perhaps going the Incubator route.
I had a decent community around this that didn’t include the Creadur
folks.

We’ll see :)

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Mattmann, Chris A (3980)
From Twitter:
https://paste.apache.org/1CPH


Don’t have to fix now, but would be good to fix for 1.11.

Cheers,
Chris

P.S. Thanks for the catch Daniel!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: 1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Tom Barber
It works well as well doesn't it I mean hey guys we're missing licence
headers normally I'd probably reply with an expletive, now I can just
reply oh DRAT.

I'll get my coat.
On 6 Aug 2015 22:44, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 -Original Message-

 From: Nick Burch apa...@gagravarr.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, August 6, 2015 at 2:25 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Cc: Daniel Gruno humbed...@apache.org
 Subject: Re: 1.10 release missing license headers noted by Daniel Gruno

 
  Not sure. Will check. Also thinking of upgrading to DRAT (instead of
  RAT):
 
  http://github.com/chrismattmann/drat/
 
 Can't you just get voted onto the Creadur PMC then get that pushed
 upstream? :)

 I thought about that - or perhaps going the Incubator route.
 I had a decent community around this that didn’t include the Creadur
 folks.

 We’ll see :)

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




[jira] [Created] (TIKA-1704) Update tika documentation for configuring ServiceLoader

2015-08-06 Thread Bob Paulin (JIRA)
Bob Paulin created TIKA-1704:


 Summary: Update tika documentation for configuring ServiceLoader
 Key: TIKA-1704
 URL: https://issues.apache.org/jira/browse/TIKA-1704
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.11
Reporter: Bob Paulin
Priority: Minor


Update documentation to account for changes in TIKA-1700



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1704) Update tika documentation for configuring ServiceLoader

2015-08-06 Thread Bob Paulin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-1704:
-
Attachment: TIKA-1704-DOCS.patch

 Update tika documentation for configuring ServiceLoader
 ---

 Key: TIKA-1704
 URL: https://issues.apache.org/jira/browse/TIKA-1704
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.11
Reporter: Bob Paulin
Priority: Minor
 Attachments: TIKA-1704-DOCS.patch


 Update documentation to account for changes in TIKA-1700



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1704) Update tika documentation for configuring ServiceLoader

2015-08-06 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661285#comment-14661285
 ] 

Bob Paulin commented on TIKA-1704:
--

[~gagravarr] Not sure if the format here is completely correct.  Let me know if 
you have any feedback.

 Update tika documentation for configuring ServiceLoader
 ---

 Key: TIKA-1704
 URL: https://issues.apache.org/jira/browse/TIKA-1704
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.11
Reporter: Bob Paulin
Priority: Minor
 Attachments: TIKA-1704-DOCS.patch


 Update documentation to account for changes in TIKA-1700



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661353#comment-14661353
 ] 

David Smiley commented on TIKA-1607:


TIKA isn't my area of expertise, but I think it should try and expose metadata 
using types that don't require dependencies, except for perhaps XML DOM or 
whatever JSON's DOM equivalent is (I don't think there is one in the JDK).  WKT 
strings could make sense as a spatial type specifically; for simple points I 
wouldn't though.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: 1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Nick Burch

On Thu, 6 Aug 2015, Tom Barber wrote:

It works well as well doesn't it I mean hey guys we're missing licence
headers normally I'd probably reply with an expletive, now I can just
reply oh DRAT.

I'll get my coat.


Isn't it Chris saying oh DRAT?

Apache Creadur finds a problem, an ASF community project, then the 
community has to fix it


If Chris's DRAT finds it, then Chris has to fix it

Right? ;-)

Nick


Re: 1.10 release missing license headers noted by Daniel Gruno

2015-08-06 Thread Mattmann, Chris A (3980)
Chris, Tyler Palsulich, Lewis John McGibbney, Mike Joyce, and
I think a few others :-) I have a postdoc, Ji-Hyun working on
it right now too :-)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch apa...@gagravarr.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Thursday, August 6, 2015 at 3:37 PM
To: dev@tika.apache.org dev@tika.apache.org
Cc: Daniel Gruno humbed...@apache.org
Subject: Re: 1.10 release missing license headers noted by Daniel Gruno

On Thu, 6 Aug 2015, Tom Barber wrote:
 It works well as well doesn't it I mean hey guys we're missing
licence
 headers normally I'd probably reply with an expletive, now I can just
 reply oh DRAT.

 I'll get my coat.

Isn't it Chris saying oh DRAT?

Apache Creadur finds a problem, an ASF community project, then the
community has to fix it

If Chris's DRAT finds it, then Chris has to fix it

Right? ;-)

Nick



[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660304#comment-14660304
 ] 

Tim Allison commented on TIKA-1607:
---

[~chrismattmann], any and all feedback would be great.  The link you sent 
requires a nasa login.  I'm not a rocket scientist, no luck. :( :)


 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660441#comment-14660441
 ] 

Ray Gauss II commented on TIKA-1607:


To clarify, the work mentioned above that uses an XPath-like syntax is only a 
workaround for mapping structured metadata into the current 'flat' metadata 
model in Tika.

I fully support moving towards a structured metadata store in a 2.0 timeframe. 
(maybe that's now?)

This is simply restating some of what's already been said, but there are many 
aspects to consider during that refactoring:
* Moving towards properly namespacing metadata (even if, for now, our 
serialization of it only contains a prefix)
* Backwards compatibility for simple string key/values
* Enabling easy serialization to XML and JSON
* Enabling easy discovery of at least top level elements
* Lightweight dependencies in tika-core
* Possible representation of binary data
* Not re-inventing the wheel

Given the above, perhaps we'd want to consider using Java DOM 
({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and 
getting child nodes, etc. rather than hard coding POJOs for each metadata 
standard we want to support.

I'll try to find some time to put together an example patch for that approach 
in the next few days.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-08-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660623#comment-14660623
 ] 

Tim Allison commented on TIKA-1678:
---

I found vaguely similar numbers against govdocs1+slice of Common Crawl: 1293 
out of 500k had \376\377 starting the title field and 14 files had another 
PDFEncoding encoding in the title field without the BOM.

Thank you, again for raising this!

 PDF metadata extraction fails to spot UTF-16 encoded title
 --

 Key: TIKA-1678
 URL: https://issues.apache.org/jira/browse/TIKA-1678
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.9
Reporter: Andrew Jackson
Priority: Minor
 Fix For: 1.10


 When extracting metadata from PDFs, we see some odd behaviour for a minority 
 of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
 not always being decoded as such.
 A specific example is here: 
 http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
 Which contains this (literal file content):
 {noformat}
 443 0 obj
 /Type/Metadata
 /Subtype/XML/Length 1978stream
 ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?
 ?adobe-xap-filters esc=CRLF?
 x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
 1.6'
 rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
 xmlns:iX='http://ns.adobe.com/iX/1.0/'
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
 pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate
 xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate
 xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
 xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/
 rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' 
 xmlns:dc='http://purl.org/dc/elements/1.1/' 
 dc:format='application/pdf'dc:titlerdf:Altrdf:li 
 xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description
 /rdf:RDF
 /x:xmpmeta
 ?xpacket end='w'?
 endstream
 endobj
 2 0 obj
 /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
 \000E\000d\000i\000t\000i\000o\000n)
 /CreationDate(D:20120718153801+01'00')
 /ModDate(D:20120718153801+01'00')
 /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
 /Author(\376\377\000T\000e\000t\000t\000i)endobj
 {noformat} 
 Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
 error, but the ones encoded in the actual PDF metadata fields should be 
 extracted accurately.
 When extracted, we get:
 {noformat}
 ...
 dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
 title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
 meta:author: \376\377\000T\000e\000t\000t\000i
 meta:author: Tetti
 ...
 {noformat}
 So, the author appears to be decoded correctly once, but the title is not. Is 
 

[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1607:
--
Attachment: TIKA-1607v3.patch

This patch adds examples for a MultilingualValue and demo/hack examples of a 
phone number value (to meet the initial example) and a multimedia tracks 
example.  

I've fixed the Json serialization so that it can handle serialization of more 
complex objects, and MetadataValues are no longer required to parse their own 
string as part of serialization.

There is still some stink around requiring a string representation in the base 
class...perhaps move back to abstract class for the base MetadataValue and use 
a StringMetadataValue for those metadata values that can reasonably be 
represented by a single string.

The phone number and mediatracks examples are purely for demo purposes.  We 
should integrate/translate [~rgauss]'s 
[tika-ffmpeg|https://github.com/AlfrescoLabs/tika-ffmpeg] properly later.

 Introduce new arbitrary object key/values data structure for persitsence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660093#comment-14660093
 ] 

Tim Allison commented on TIKA-1607:
---

Y, I agree that we should push the parsers to do as much as possible.  I think 
whether we push complexity into the values or into the properties, the users 
will still have to take the time to learn about the options.  

In favor of my proposal: the values have actual Java object values with 
primitives, etc.  The user/Metadata object is not responsible for converting 
those strings to actual Java values (e.g. getDate/getInt)...the knowledge for 
those underlying values is put into the values and the API for those values.  
We could have enums and other typed/checked objects.  

Y, the user has to learn what methods are available, but the user has to learn 
about the sub-properties of the properties, too.


For the record, I really don't like the doubling up of responsibility for 
checking whether a given property can go with a given value in my proposal.  
And, the patch is still quite rough.

As you suggest, it would help to see what the client code would look like for 
the PhoneNumber, MultiLingual and MediaTrack examples.  Would there be a way to 
encode a geoshape?  What would that look like?

 Introduce new arbitrary object key/values data structure for persitsence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660145#comment-14660145
 ] 

Tim Allison commented on TIKA-1607:
---

Doh! A related point: binary values. At some point I think Jukka(?) suggested 
putting thumbnails of an embedded document in that doc's metadata

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1607:
--
Summary: Introduce new arbitrary object key/values data structure for 
persistence of Tika Metadata  (was: Introduce new arbitrary object key/values 
data structure for persitsence of Tika Metadata)

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)