On the metadata stuff, I'm coming around to Ray Gauss's proposal.  I wanted too 
much back then, and his solution is super elegant, IIRC.

-----Original Message-----
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Monday, February 5, 2018 11:37 AM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Ping - anyone got any thoughts on the proposed metadata parser stuff, and any 
ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:
> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence 
>> and _overwrites_ the other. Overwrite is but one option (you could 
>> save *all* the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the 
> metadata
> case:
> https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2
> FAdditive And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at 
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the 
> Content Handler series of SAX events when we move onto a second+ parser for a 
> file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>>
>>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>    > My general approach to conflicting metadata is simply to define
>>    > precedence orders.
>>    >
>>    > For example here is one documented from OODT:
>>    >
>>    >
>> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>    >
>>    > We can do similar things with Tika, e.g.,
>>    >
>>    > [CoreMetadata.PROPERTIES]
>>    > [ImageParser.METADATA]
>>    > [TikaOCR.METADATA]
>>
>>    What happens if two different parsers both output the same bit of 
>> metadata
>>    though? eg Tim's example of one giving dc:creator of Tim and the second
>>    giving dc:creator of Chris?
>> 
>>
>>    Secondly, what about the XHTML sax events stream? I think that's 
>> probably
>>    the harder case...
>>
>>    Nick

Reply via email to