[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660283#comment-14660283 ] Chris A. Mattmann commented on TIKA-1607: - I'm confused about Ray's Tika FFMPEG that you're talking about. Also, [~gostep] the stuff being talked about here (how to handle properties and typed values, names, etc.) is precisely is what you I think were trying to get at with your proposal and so forth so you should probably comment here. Good job on actually producing code for this [~talli...@apache.org] I'd like to take a look at it more before commenting further. One thing I know too is that the [OODT Metadata Object|https://oodt.jpl.nasa.gov/jira/si/jira.issueviews:issue-html/OODT-303/OODT-303.html] discussion that we had internally at JPL a long time ago is EXTREMELY similar to this one and it should be considered. I pointed [~lewismc] at this during the initial discussion. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 1.10 release missing license headers noted by Daniel Gruno
I think we may have exclusions here since they are test resources? Not sure. Will check. Also thinking of upgrading to DRAT (instead of RAT): http://github.com/chrismattmann/drat/ See all the prezos, etc., there for why. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, August 6, 2015 at 12:43 PM To: dev@tika.apache.org dev@tika.apache.org Cc: Daniel Gruno humbed...@apache.org Subject: Re: 1.10 release missing license headers noted by Daniel Gruno On Thu, 6 Aug 2015, Mattmann, Chris A (3980) wrote: From Twitter: https://paste.apache.org/1CPH Don’t have to fix now, but would be good to fix for 1.11. Don't we have Apache Creadur (formerly Rat) setup on the build? If so, how did it pass? If not, can someone turn it on ASAP? :) Nick
Re: 1.10 release missing license headers noted by Daniel Gruno
On Thu, 6 Aug 2015, Mattmann, Chris A (3980) wrote: I think we may have exclusions here since they are test resources? The tika-parsers/src/test/resources/test-documents/ shouldn't have headers at all, and the txt.Charset ones are taken from Icu4j so have their original license header on them, but the rest should do Not sure. Will check. Also thinking of upgrading to DRAT (instead of RAT): http://github.com/chrismattmann/drat/ Can't you just get voted onto the Creadur PMC then get that pushed upstream? :) Nick
Re: 1.10 release missing license headers noted by Daniel Gruno
-Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, August 6, 2015 at 2:25 PM To: dev@tika.apache.org dev@tika.apache.org Cc: Daniel Gruno humbed...@apache.org Subject: Re: 1.10 release missing license headers noted by Daniel Gruno Not sure. Will check. Also thinking of upgrading to DRAT (instead of RAT): http://github.com/chrismattmann/drat/ Can't you just get voted onto the Creadur PMC then get that pushed upstream? :) I thought about that - or perhaps going the Incubator route. I had a decent community around this that didn’t include the Creadur folks. We’ll see :) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
1.10 release missing license headers noted by Daniel Gruno
From Twitter: https://paste.apache.org/1CPH Don’t have to fix now, but would be good to fix for 1.11. Cheers, Chris P.S. Thanks for the catch Daniel! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 1.10 release missing license headers noted by Daniel Gruno
It works well as well doesn't it I mean hey guys we're missing licence headers normally I'd probably reply with an expletive, now I can just reply oh DRAT. I'll get my coat. On 6 Aug 2015 22:44, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: -Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, August 6, 2015 at 2:25 PM To: dev@tika.apache.org dev@tika.apache.org Cc: Daniel Gruno humbed...@apache.org Subject: Re: 1.10 release missing license headers noted by Daniel Gruno Not sure. Will check. Also thinking of upgrading to DRAT (instead of RAT): http://github.com/chrismattmann/drat/ Can't you just get voted onto the Creadur PMC then get that pushed upstream? :) I thought about that - or perhaps going the Incubator route. I had a decent community around this that didn’t include the Creadur folks. We’ll see :) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Created] (TIKA-1704) Update tika documentation for configuring ServiceLoader
Bob Paulin created TIKA-1704: Summary: Update tika documentation for configuring ServiceLoader Key: TIKA-1704 URL: https://issues.apache.org/jira/browse/TIKA-1704 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.11 Reporter: Bob Paulin Priority: Minor Update documentation to account for changes in TIKA-1700 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1704) Update tika documentation for configuring ServiceLoader
[ https://issues.apache.org/jira/browse/TIKA-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Paulin updated TIKA-1704: - Attachment: TIKA-1704-DOCS.patch Update tika documentation for configuring ServiceLoader --- Key: TIKA-1704 URL: https://issues.apache.org/jira/browse/TIKA-1704 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.11 Reporter: Bob Paulin Priority: Minor Attachments: TIKA-1704-DOCS.patch Update documentation to account for changes in TIKA-1700 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1704) Update tika documentation for configuring ServiceLoader
[ https://issues.apache.org/jira/browse/TIKA-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661285#comment-14661285 ] Bob Paulin commented on TIKA-1704: -- [~gagravarr] Not sure if the format here is completely correct. Let me know if you have any feedback. Update tika documentation for configuring ServiceLoader --- Key: TIKA-1704 URL: https://issues.apache.org/jira/browse/TIKA-1704 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.11 Reporter: Bob Paulin Priority: Minor Attachments: TIKA-1704-DOCS.patch Update documentation to account for changes in TIKA-1700 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661353#comment-14661353 ] David Smiley commented on TIKA-1607: TIKA isn't my area of expertise, but I think it should try and expose metadata using types that don't require dependencies, except for perhaps XML DOM or whatever JSON's DOM equivalent is (I don't think there is one in the JDK). WKT strings could make sense as a spatial type specifically; for simple points I wouldn't though. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 1.10 release missing license headers noted by Daniel Gruno
On Thu, 6 Aug 2015, Tom Barber wrote: It works well as well doesn't it I mean hey guys we're missing licence headers normally I'd probably reply with an expletive, now I can just reply oh DRAT. I'll get my coat. Isn't it Chris saying oh DRAT? Apache Creadur finds a problem, an ASF community project, then the community has to fix it If Chris's DRAT finds it, then Chris has to fix it Right? ;-) Nick
Re: 1.10 release missing license headers noted by Daniel Gruno
Chris, Tyler Palsulich, Lewis John McGibbney, Mike Joyce, and I think a few others :-) I have a postdoc, Ji-Hyun working on it right now too :-) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch apa...@gagravarr.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, August 6, 2015 at 3:37 PM To: dev@tika.apache.org dev@tika.apache.org Cc: Daniel Gruno humbed...@apache.org Subject: Re: 1.10 release missing license headers noted by Daniel Gruno On Thu, 6 Aug 2015, Tom Barber wrote: It works well as well doesn't it I mean hey guys we're missing licence headers normally I'd probably reply with an expletive, now I can just reply oh DRAT. I'll get my coat. Isn't it Chris saying oh DRAT? Apache Creadur finds a problem, an ASF community project, then the community has to fix it If Chris's DRAT finds it, then Chris has to fix it Right? ;-) Nick
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660304#comment-14660304 ] Tim Allison commented on TIKA-1607: --- [~chrismattmann], any and all feedback would be great. The link you sent requires a nasa login. I'm not a rocket scientist, no luck. :( :) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660441#comment-14660441 ] Ray Gauss II commented on TIKA-1607: To clarify, the work mentioned above that uses an XPath-like syntax is only a workaround for mapping structured metadata into the current 'flat' metadata model in Tika. I fully support moving towards a structured metadata store in a 2.0 timeframe. (maybe that's now?) This is simply restating some of what's already been said, but there are many aspects to consider during that refactoring: * Moving towards properly namespacing metadata (even if, for now, our serialization of it only contains a prefix) * Backwards compatibility for simple string key/values * Enabling easy serialization to XML and JSON * Enabling easy discovery of at least top level elements * Lightweight dependencies in tika-core * Possible representation of binary data * Not re-inventing the wheel Given the above, perhaps we'd want to consider using Java DOM ({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and getting child nodes, etc. rather than hard coding POJOs for each metadata standard we want to support. I'll try to find some time to put together an example patch for that approach in the next few days. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660623#comment-14660623 ] Tim Allison commented on TIKA-1678: --- I found vaguely similar numbers against govdocs1+slice of Common Crawl: 1293 out of 500k had \376\377 starting the title field and 14 files had another PDFEncoding encoding in the title field without the BOM. Thank you, again for raising this! PDF metadata extraction fails to spot UTF-16 encoded title -- Key: TIKA-1678 URL: https://issues.apache.org/jira/browse/TIKA-1678 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.9 Reporter: Andrew Jackson Priority: Minor Fix For: 1.10 When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents. The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such. A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf Which contains this (literal file content): {noformat} 443 0 obj /Type/Metadata /Subtype/XML/Length 1978stream ?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'? ?adobe-xap-filters esc=CRLF? x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6' rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/' rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'xmp:ModifyDate2012-07-18T15:38:01+01:00/xmp:ModifyDate xmp:CreateDate2012-07-18T15:38:01+01:00/xmp:CreateDate xmp:CreatorToolUnknownApplication/xmp:CreatorTool/rdf:Description rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/ rdf:Description rdf:about='ac9f232e-d341-11e1--ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'dc:titlerdf:Altrdf:li xml:lang='x-default'\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x/rdf:li/rdf:Alt/dc:titledc:creatorrdf:Seqrdf:li\376\377\000T\000e\000t\000t\000i/rdf:li/rdf:Seq/dc:creator/rdf:Description /rdf:RDF /x:xmpmeta ?xpacket end='w'? endstream endobj 2 0 obj /Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n) /CreationDate(D:20120718153801+01'00') /ModDate(D:20120718153801+01'00') /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x) /Author(\376\377\000T\000e\000t\000t\000i)endobj {noformat} Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the ones encoded in the actual PDF metadata fields should be extracted accurately. When extracted, we get: {noformat} ... dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x meta:author: \376\377\000T\000e\000t\000t\000i meta:author: Tetti ... {noformat} So, the author appears to be decoded correctly once, but the title is not. Is
[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1607: -- Attachment: TIKA-1607v3.patch This patch adds examples for a MultilingualValue and demo/hack examples of a phone number value (to meet the initial example) and a multimedia tracks example. I've fixed the Json serialization so that it can handle serialization of more complex objects, and MetadataValues are no longer required to parse their own string as part of serialization. There is still some stink around requiring a string representation in the base class...perhaps move back to abstract class for the base MetadataValue and use a StringMetadataValue for those metadata values that can reasonably be represented by a single string. The phone number and mediatracks examples are purely for demo purposes. We should integrate/translate [~rgauss]'s [tika-ffmpeg|https://github.com/AlfrescoLabs/tika-ffmpeg] properly later. Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660093#comment-14660093 ] Tim Allison commented on TIKA-1607: --- Y, I agree that we should push the parsers to do as much as possible. I think whether we push complexity into the values or into the properties, the users will still have to take the time to learn about the options. In favor of my proposal: the values have actual Java object values with primitives, etc. The user/Metadata object is not responsible for converting those strings to actual Java values (e.g. getDate/getInt)...the knowledge for those underlying values is put into the values and the API for those values. We could have enums and other typed/checked objects. Y, the user has to learn what methods are available, but the user has to learn about the sub-properties of the properties, too. For the record, I really don't like the doubling up of responsibility for checking whether a given property can go with a given value in my proposal. And, the patch is still quite rough. As you suggest, it would help to see what the client code would look like for the PhoneNumber, MultiLingual and MediaTrack examples. Would there be a way to encode a geoshape? What would that look like? Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660145#comment-14660145 ] Tim Allison commented on TIKA-1607: --- Doh! A related point: binary values. At some point I think Jukka(?) suggested putting thumbnails of an embedded document in that doc's metadata Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1607: -- Summary: Introduce new arbitrary object key/values data structure for persistence of Tika Metadata (was: Introduce new arbitrary object key/values data structure for persitsence of Tika Metadata) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)