Re: [MarkLogic Dev General] MD5 - Hash Question

Sujith Mon, 21 Jul 2014 13:10:40 -0700

Many Thanks on this.

Will take your inputs into consideration.


David, we ran this across ~couple of Million records across ML vs Hadoop
and things look good, but will take a closer look.




On Mon, Jul 21, 2014 at 4:40 AM, Anthony Coates <anthony.coa...@db.com>
wrote:

> Classification: Public
>
> I like hashing as a lower-cost way of doing reconciliation, but that
> usually means that you are running the hashing algorithm at both ends and
> comparing the hashes, so running something through ML to normalise it
> probably isn't the best option.
>
> That said, you do need to design your approach carefully so that what you
> hash is identical at both ends.
>
> Canonicalisation is one approach.  I will sometimes run a document through
> an XSLT stylesheet, or something like that, to get it into a particular
> consistent format.  You need to test that it reliably produces the same
> output and hashes at both ends, but once you have that working, it's
> usually quicker and cheaper than shipping all of the data from one system
> to another in order to reconcile it.
>
> Cheers, Tony.
>
>
>  From: Joe Bryan <joe.br...@marklogic.com> To: MarkLogic Developer
> Discussion <general@developer.marklogic.com>,  Date: 18/07/2014 20:47
> Subject: Re: [MarkLogic Dev General] MD5 - Hash Question
> ------------------------------
>
>
>
> Hi David,
>
> Wouldn't it be possible to hash a consistent serialization (i.e., first
> load the documents into ML, then serialize, than hash the output)? I'm sure
> there are edge cases that can't be fully resolved, but it seems like a
> reasonably complete solution would be possible.
>
> Thanks.
>
> -jb
>
> *From: *David Lee <*david....@marklogic.com* <david....@marklogic.com>>
> * Reply-To: *MarkLogic Developer Discussion <
> *general@developer.marklogic.com* <general@developer.marklogic.com>>
> * Date: *Friday, July 18, 2014 at 1:02 PM
> * To: *MarkLogic Developer Discussion <*general@developer.marklogic.com*
> <general@developer.marklogic.com>>
> * Subject: *Re: [MarkLogic Dev General] MD5 - Hash Question
>
> I'm glad you came to a solution, but if you count on it, you're counting
> on good luck not fact.
> It might well work for the document instances you have tested today,
> It may very well work for some future documents -  but its absolutely *not
> be a valid test* for what your attempting and its trivial to
> Demonstrate experimentally.
>
> Create this XML file  (yes include the spaces)
> <foo                       a="b"/>
>
> Run an MD5
> Put it on ML
> Pull it off using any serialization options you want
> Run an MD5
> They won't match, and that’s one of many cases that won't work.
> Here's another
> <foo b="a" a="b"></foo>
> -------------------------------
> And another
> -------------------------------------------
> <foo xmlns:a="a" xmlns:a="b" />
> The set of XML instances that won't survive a round trip is infinite.
> You can get *closer* if you cannibalize your documents ...
>
> *http://www.w3.org/TR/xml-c14n* <http://www.w3.org/TR/xml-c14n>
>
> But you'd have to get your source to do that, because its not reversible.
>
>
> Its up to you to decide if that matters or what the severity of the
> problem will be when it fails (which it will).
> But I feel compelled to say that using hashes in this way is not a valid
> method for detecting XML document identity or equality.
> Using it as such is at your own risk.
>
>
>
>
> *From:* *general-boun...@developer.marklogic.com*
> <general-boun...@developer.marklogic.com> [
> *mailto:general-boun...@developer.marklogic.com*
> <general-boun...@developer.marklogic.com>] *On Behalf Of *Sujith
> * Sent:* Friday, July 18, 2014 11:50 AM
> * To:* MarkLogic Developer Discussion
> * Subject:* Re: [MarkLogic Dev General] MD5 - Hash Question
>
> Thanks Lee / Ennis,
>
> Many thanks for the Insight. We think we have the md5 hash fixed .
>
> Here is the requirements ( Compare XML documents in MarkLogic vs Hadoop
> System )
>
> 1. We have constant updates to the files that are in MarkLogic.
> 2. Here in our shop Hadoop is the central repository / DataStore that
> collects the data from all the systems that support the organization, and
> MarkLogic is one of the feeder.
> 3. We use mlcp to feed the updates / data to Hadoop , we have
> lastUpdateDateTime element that is used to capture the updates and feed the
> incrementals.
> 4. for now all the data is XML data.
> 5. Here the truth is MarkLogic ( Feeder to Hadoop ) and at a given point
> Hadoop wants to reconcile the Data.
> 6. To achieve this we went with md5 as an approach ( the same way XQSync
> does ) .
> 7. When we provided Hash of the documents, we were told by the Hadoop team
> that they don't match with the Documents they have
>
> After doing some more analysis on this, we figured that when we use
> <omit-xml-declaration> option YES  on xdmp:quote(fn:doc("/sample.xml") ,
> the hashes match.
> So when mlcp loads the data to Hadoop, we see that the XML declaration is
> missing / omitted and this is the difference between the source & target
> that is giving us a hash mismatch.
>
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote">
>       <omit-xml-declaration>yes</omit-xml-declaration>
>     </options>))
>
>
> Many Thanks!!!
>
> On Fri, Jul 18, 2014 at 8:13 AM, David Lee <*david....@marklogic.com*
> <david....@marklogic.com>> wrote:
> Another suggestion (which I used in the past) is to do the MD5 on the text
> document *before*
> Sending it to ML and storing it as a property.  Then when a new document
> arrives check the MD5
> Of the text document (on disk) and if they match what I stored then I know
> they are the same and skip it,
> If they don’t match, I know something has changed (may be irrelevant
> whitespace) so I updated the doc.
> This works well for the purposes of 'sync' like tools - where it is useful
> to assert
>
> A)     If the MD5 of the new *file* is different than the MD5 of the last
> *file* then
> the document *MAY* be different
>
> B)      If the MD5 of the new *file* is identical to the MD5 of the last
> *file* then
> the document *MUST* be the same.
>
> This, however, does not solve the question of "did you store the same
> document I sent you"
> I suggest that question is conceptually invalid or misguided in the first
> place, and to go back
> To whomever is asking you that and ask for a clarification.   What do they
> really mean ?
> What is the real goal ?   You can satisfy the "checkbox requirement" by
> storing a "binary" version
> Of the document alongside the XML one and checksum that..
> Then you can happily say "Yes" ... but that is all just giving people what
> they asked for not what they need.
> The concept of document equality is not the same thing as file byte
> representation - period.
> You can do the deep-equals yes.  Its fairly expensive, but is that what
> they want ?
> How can you prove you did it instead of just returning "true" ?
> The question/requirement itself means that there is a mismatch in
> understanding of the true needs.
> Do they understand your using a database not a document repository ?
> For example if they sent you a CSV file to store in Oracle ... then wanted
> you to prove later it that
> it was stored properly by sending back a checksum ... of what ?  Same
> problem.
>
> If you are expected to store the "file" as a blob and that’s part of the
> requirements then
> you need to store the file as a "blob" (in ML that would be a binary, or
> large binary  ).
> But if the assumption is that "blob" is actually what your querying ...
> that’s simply wrong ... and completely
> useless to ask for a checksum of the file - even if you provided it  who's
> to say your storing a 1MB "blob"
> As a binary But your ML XML document you store isn’t just "<bigfile/>" ...
> You've satisfied the customer by giving them what they asked for but have
> not solved any actual real problem,
> And just added a bunch of work.  You might as well just store the MD5 of
> the original file and give that back.
>
> Even if you *did* ONLY store the blob (say in a file and didn’t bother
> with ML at all) - you can satisfy the customer,
> but unless your only requirement was to store the file. But your app could
> just produce random results.
>
> In general - I find this a classic case of micro-management or
> miss-communication of requirements.
> Very common ... a customer wants something but the only way they know to
> express it is in terms of the things they
> know ... so instead of expressing the requirement like "provide a method
> to validated you received the document correctly" and "provide  a method to
> validate that the application is returning the correct results for the
> latest document" they ask "give me a checksum of the file".   You can give
> them what they ask for or you can
> Give them what they really need.   But if you just blindly follow what
> they ask for you can end up with vastly more work, a bad product and a
> customer who is unhappy, or misled.
>
>
> When these cases come up - usually the intent is  this
>
> 1)      I want to make sure you *received* the file I sent correctly:
> This is very valid.
> Then store the MD5, Length, Timestamp as a property object with the
> document.
>
> 2)      If the core requirement is for you to store a "blob" ... that is
> a document garneted
> to be unchanged byte-wise and retrievable in its original exact byte
> repetition
> then store the document as a binary.
> However you won't be able to make much use of it besides returning it.
>
> 3)      If the requirement is to be a "blob store" AND a "Document
> Database" both,
> then you need to store both the binary and the parsed document.
> But when asked for validation - the customer must understand that you're
> not querying the blob,
> you can return it on demand to prove you have it, but you need a different
> definition of equality
> to prove your database document is the same as the blob.
> Also what does that prove ?  You can prove it by end to end testing of
> your application - making sure that any queries produce the expected
> results.   Anything less than that doesn’t prove much.
>
> But you can store both, validate that you stored the blob correctly,
> retrieve it on demand,
> And use the parsed document for querying.  You can explain it similar to
> loading it into memory.
> You can't prove a Java Object (say a tree structure) is exactly the same
> as the blob - that can only be done by higher level application testing.
>
> 4)      If the requirement is that you Store a Blob but Also be able to
> query it and Also be able
> to verify that *what you are using to query* is byte-identical to the blob.
> That question is invalid.    You can't do both, the person asking for it
> needs to understand that
> or you need to not promise to do it because its either impossible or
> pointless.
> Any attempt to provide the customer what they asked for would simply be a
> lie.
>
> 5)      Mixed with all these, many people misunderstand documents
> entirely - and actually believe that the file format "Is The Document" ...
> in all meanings of the word.  This is debatable, but staying away from pure
> philosophy and abstract math -- in practice it generally a
> misunderstanding.   If you open even a .txt file and save it - without
> changes.  And the result has a different checksum, is the document the same
> ?
> What if the only difference is changing CR/LF to LF ? are they the same ?
> What if its changed from ASCII to UTF8 ... are they the same ?
> The answer lies in the context ... "For what purpose do you define
> equality"
>
> Any answer that doesn’t have the context explicit is going to be wrong or
> not useful.
>
> To end that long story :)
> I strongly encourage you find out precisely what the real intent and
> requirements are,
> If you don’t, then any solution to solve the problem as stated is very
> likely to be misguided and not solve the real problems.
>
>
>
>
>
>
>
>
>
> *From:**general-boun...@developer.marklogic.com*
> <general-boun...@developer.marklogic.com> [mailto:
> *general-boun...@developer.marklogic.com*
> <general-boun...@developer.marklogic.com>] *On Behalf Of *David Ennis
> * Sent:* Friday, July 18, 2014 5:23 AM
> * To:* MarkLogic Developer Discussion
> * Subject:* Re: [MarkLogic Dev General] MD5 - Hash Question
>
> HI.
>
> For what has been described, perhaps fn:deep-equal can help as it
> more-or-less takes into account the 'pitfalls' listed by David Lee:
>
> - for attributes, order does not matter
> - whitepace is ignored in element definitions as well as between elements
> (essentially ignored if not an atomoc value, I would assume)
> - (presumably because the comparison is done on the internal
> representation), then it is also true that things like single-quote or
> double-quote make no difference (not the case if you were doing a hash
> check)
>
> Therefore, document are equal if all of the nodes, child nodes, etc are
> present with the exact same attributes and the same values in all places -
> while ignoring order of attributes as well as un-needed whitespace.
>
> In a Nutshell:
> <foo > <bar a="b" c='d'>baz</bar></foo>
> equals
> <foo> <bar  c='d' a='b'>baz</bar>    </foo>
> (and of course this follows for deep structures as well)
>
> So, perhaps a solution would be a check in MarkLogic:
> A =  internal doc
> B = fetch doc from Hadoop via http/odbc/whatever and do a deep-diff on it.
> (or, expose deep-diff via some ML API and then just have some system that
> fetches form hadoop and asks ML if the doc matches the one it has)
>
> Kind Regards,
> David Ennis
>
> On 17 July 2014 19:56, David Lee <*david....@marklogic.com*
> <david....@marklogic.com>> wrote:
> I will reply with a shorter version of what came up recently on another
> list.
> ML uses UTF8 internally for xs:string which is what xdmp:quote() returns,
> But that’s misleading.  Its irrelevant because the encoding of a string is
> not exposed at the XQuery layer.
> We could be using anything and the behavior would be the same.
> The only time encoding comes into play is when serializing or
> desterilizing from *bytes* (i.e to/from a file or text document)
>
> Unless your using text or binary documents then generating a hash on them
> is pointless.
> You're not going to get the same hash as the original file .  This is true
> for any processor of XML or JSON or structured documents that does any
> parsing or serializing.
>
>
> Documents stored in the database are not byte-equal to the source document.
>
> This is true at multiple levels.   At the "store on the disk" level a
> "Document" doesn’t resemble the source document at all,
>
> Any more than a CSV file resembles the block structure of an Oracle
> partition - let alone byte equal.
>
> For some primitive document types - namely binary, what you get back
> should be equal to what you put in - exactly, But even text documents might
> undergo character set translation , Unicode normalization so what you get
> back may Not be byte equal ...
>
> Structured docs are more complex,  XML and JSON first undergo charset
> translation like text, then they are parsed Into an internal node structure
> - and then stored in a very concise format.
>
> Even ignoring the disk format and MarkLogic ... ALL XML and JSON
> processors share this issue.
>
> The Text Serialized form of a document is not the same thing as its value.
>   You can rarely, if ever, do a round trip
>
> On JSON or XML and get back byte for byte what you started with - Its
> *critical* to understand that the byte/text format
>
> Of documents is the transport layer, not the document model itself.  And
> with that concept, there are many equivalent ways to express the same
> document model.   Very simple example, in XML attributes have no ordering
> guarantee (nor in JSON),
>
> Spaces between attributes like <foo  a="b"     c="d"/>  are ignored so
> <foo c="d" a="b/> is the same document.
>
> To wrap this up, calculating a hash of a document before storing it , and
> after retrieving it doesn’t give you the answer you want.
> The hashes will almost certainly be different - whether or not the
> document "is the same thing" ...
>
> To provide better advise on how to compare for document equality I need
> more specifics such as what format the document is in, and how you want to
> define equality.
>
>
> *From:**general-boun...@developer.marklogic.com*
> <general-boun...@developer.marklogic.com> [mailto:
> *general-boun...@developer.marklogic.com*
> <general-boun...@developer.marklogic.com>] *On Behalf Of *Sujith
> * Sent:* Thursday, July 17, 2014 11:20 AM
> * To:* MarkLogic Developer Discussion
> * Subject:* [MarkLogic Dev General] MD5 - Hash Question
>
> What is the default encoding that ML uses for xdmp:quote().
>
> There is a daily job that loads hadoop ( Cloudera Dist ) with the files
> that we have in ML using mlcp. Now we want to compare if both of them are
> in Sync, so we are using md5 hash for validation. Initially we provided
> Hadoop with our Hash and they came back saying that it didn't match with
> their data. After doing some analysis we figured out that we should
> explicitly specify the encoding as UTF-8   option in xdmp:quote as Java
> Program on their end is doing the same.
>
> (: The Hash that didnot match :)
> xquery version "1.0-ml";
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml") )
>
>
> In other words what would be the default encoding xdmp:quote uses ( My
> assumpotion is that by Default ML saves Data as UTF-8 encoding is no
> encoding is specified and while it retrieves the documents the same would
> be used. )
>
>
> (: the Hash that Match :)
>
> xquery version "1.0-ml";
> xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote">
>       <output-encoding>utf-8</output-encoding>
>       <omit-xml-declaration>yes</omit-xml-declaration>
>     </options>))
>
> Any insight is very much appreciated.
>
>
> --
> Thanks & Regards
> SujithMaram
>
> _______________________________________________
> General mailing list
> *General@developer.marklogic.com* <General@developer.marklogic.com>
> *http://developer.marklogic.com/mailman/listinfo/general*
> <http://developer.marklogic.com/mailman/listinfo/general>
>
>
> _______________________________________________
> General mailing list
> *General@developer.marklogic.com* <General@developer.marklogic.com>
> *http://developer.marklogic.com/mailman/listinfo/general*
> <http://developer.marklogic.com/mailman/listinfo/general>
>
>
>
> --
> Thanks & Regards
> SujithMaram
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
>
>
> ---
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and delete this e-mail. Any
> unauthorized copying, disclosure or distribution of the material in this
> e-mail is strictly forbidden.
>
> Please refer to http://www.db.com/en/content/eu_disclosures.htm for
> additional EU corporate and regulatory disclosures and to
> http://www.db.com/unitedkingdom/content/privacy.htm for information about
> privacy.
>
>
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
>


-- 
Thanks & Regards
SujithMaram

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] MD5 - Hash Question

Reply via email to