[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148914#comment-15148914
 ] 

Pascal Essiembre commented on TIKA-1607:
----------------------------------------

In the case of XFA forms, the form IS the content. 

One issue I can see with not extracting XFA text as part of the content by 
default, the generic message put by PDF/XFA editors will be extracted as if it 
was legitimate content:

{noformat}
Please wait... 
  
If this message is not eventually replaced by the proper contents of the 
document, your PDF 
viewer may not be able to display this type of document. 
  
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or 
Linux® by 
visiting  http://www.adobe.com/go/reader_download. 
  
For more assistance with Adobe Reader visit  http://www.adobe.com/go/acrreader. 
  
Windows is either a registered trademark or a trademark of Microsoft 
Corporation in the United States and/or other countries. Mac is a trademark 
of Apple Inc., registered in the United States and other countries. Linux is 
the registered trademark of Linus Torvalds in the U.S. and other 
countries.
{noformat}

That message is an example inserted by PDF/XFA editors as a workaround for PDF 
viewers not supporting XFA and is not genuine content published by the author 
(the XFA forms are).

I'll support whichever way you pick, but I personally can't see use cases where 
extracting that workaround message is the intent when using Tika.  I do see 
value in keeping the entire DOM though.  Maybe you can do as you suggest, but 
"in addition" to returning the XFA text as the content?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.13
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to