Many Thanks on this. Will take your inputs into consideration.
David, we ran this across ~couple of Million records across ML vs Hadoop and things look good, but will take a closer look. On Mon, Jul 21, 2014 at 4:40 AM, Anthony Coates <anthony.coa...@db.com> wrote: > Classification: Public > > I like hashing as a lower-cost way of doing reconciliation, but that > usually means that you are running the hashing algorithm at both ends and > comparing the hashes, so running something through ML to normalise it > probably isn't the best option. > > That said, you do need to design your approach carefully so that what you > hash is identical at both ends. > > Canonicalisation is one approach. I will sometimes run a document through > an XSLT stylesheet, or something like that, to get it into a particular > consistent format. You need to test that it reliably produces the same > output and hashes at both ends, but once you have that working, it's > usually quicker and cheaper than shipping all of the data from one system > to another in order to reconcile it. > > Cheers, Tony. > > > From: Joe Bryan <joe.br...@marklogic.com> To: MarkLogic Developer > Discussion <general@developer.marklogic.com>, Date: 18/07/2014 20:47 > Subject: Re: [MarkLogic Dev General] MD5 - Hash Question > ------------------------------ > > > > Hi David, > > Wouldn't it be possible to hash a consistent serialization (i.e., first > load the documents into ML, then serialize, than hash the output)? I'm sure > there are edge cases that can't be fully resolved, but it seems like a > reasonably complete solution would be possible. > > Thanks. > > -jb > > *From: *David Lee <*david....@marklogic.com* <david....@marklogic.com>> > * Reply-To: *MarkLogic Developer Discussion < > *general@developer.marklogic.com* <general@developer.marklogic.com>> > * Date: *Friday, July 18, 2014 at 1:02 PM > * To: *MarkLogic Developer Discussion <*general@developer.marklogic.com* > <general@developer.marklogic.com>> > * Subject: *Re: [MarkLogic Dev General] MD5 - Hash Question > > I'm glad you came to a solution, but if you count on it, you're counting > on good luck not fact. > It might well work for the document instances you have tested today, > It may very well work for some future documents - but its absolutely *not > be a valid test* for what your attempting and its trivial to > Demonstrate experimentally. > > Create this XML file (yes include the spaces) > <foo a="b"/> > > Run an MD5 > Put it on ML > Pull it off using any serialization options you want > Run an MD5 > They won't match, and that’s one of many cases that won't work. > Here's another > <foo b="a" a="b"></foo> > ------------------------------- > And another > ------------------------------------------- > <foo xmlns:a="a" xmlns:a="b" /> > The set of XML instances that won't survive a round trip is infinite. > You can get *closer* if you cannibalize your documents ... > > *http://www.w3.org/TR/xml-c14n* <http://www.w3.org/TR/xml-c14n> > > But you'd have to get your source to do that, because its not reversible. > > > Its up to you to decide if that matters or what the severity of the > problem will be when it fails (which it will). > But I feel compelled to say that using hashes in this way is not a valid > method for detecting XML document identity or equality. > Using it as such is at your own risk. > > > > > *From:* *general-boun...@developer.marklogic.com* > <general-boun...@developer.marklogic.com> [ > *mailto:general-boun...@developer.marklogic.com* > <general-boun...@developer.marklogic.com>] *On Behalf Of *Sujith > * Sent:* Friday, July 18, 2014 11:50 AM > * To:* MarkLogic Developer Discussion > * Subject:* Re: [MarkLogic Dev General] MD5 - Hash Question > > Thanks Lee / Ennis, > > Many thanks for the Insight. We think we have the md5 hash fixed . > > Here is the requirements ( Compare XML documents in MarkLogic vs Hadoop > System ) > > 1. We have constant updates to the files that are in MarkLogic. > 2. Here in our shop Hadoop is the central repository / DataStore that > collects the data from all the systems that support the organization, and > MarkLogic is one of the feeder. > 3. We use mlcp to feed the updates / data to Hadoop , we have > lastUpdateDateTime element that is used to capture the updates and feed the > incrementals. > 4. for now all the data is XML data. > 5. Here the truth is MarkLogic ( Feeder to Hadoop ) and at a given point > Hadoop wants to reconcile the Data. > 6. To achieve this we went with md5 as an approach ( the same way XQSync > does ) . > 7. When we provided Hash of the documents, we were told by the Hadoop team > that they don't match with the Documents they have > > After doing some more analysis on this, we figured that when we use > <omit-xml-declaration> option YES on xdmp:quote(fn:doc("/sample.xml") , > the hashes match. > So when mlcp loads the data to Hadoop, we see that the XML declaration is > missing / omitted and this is the difference between the source & target > that is giving us a hash mismatch. > > xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote"> > <omit-xml-declaration>yes</omit-xml-declaration> > </options>)) > > > Many Thanks!!! > > On Fri, Jul 18, 2014 at 8:13 AM, David Lee <*david....@marklogic.com* > <david....@marklogic.com>> wrote: > Another suggestion (which I used in the past) is to do the MD5 on the text > document *before* > Sending it to ML and storing it as a property. Then when a new document > arrives check the MD5 > Of the text document (on disk) and if they match what I stored then I know > they are the same and skip it, > If they don’t match, I know something has changed (may be irrelevant > whitespace) so I updated the doc. > This works well for the purposes of 'sync' like tools - where it is useful > to assert > > A) If the MD5 of the new *file* is different than the MD5 of the last > *file* then > the document *MAY* be different > > B) If the MD5 of the new *file* is identical to the MD5 of the last > *file* then > the document *MUST* be the same. > > This, however, does not solve the question of "did you store the same > document I sent you" > I suggest that question is conceptually invalid or misguided in the first > place, and to go back > To whomever is asking you that and ask for a clarification. What do they > really mean ? > What is the real goal ? You can satisfy the "checkbox requirement" by > storing a "binary" version > Of the document alongside the XML one and checksum that.. > Then you can happily say "Yes" ... but that is all just giving people what > they asked for not what they need. > The concept of document equality is not the same thing as file byte > representation - period. > You can do the deep-equals yes. Its fairly expensive, but is that what > they want ? > How can you prove you did it instead of just returning "true" ? > The question/requirement itself means that there is a mismatch in > understanding of the true needs. > Do they understand your using a database not a document repository ? > For example if they sent you a CSV file to store in Oracle ... then wanted > you to prove later it that > it was stored properly by sending back a checksum ... of what ? Same > problem. > > If you are expected to store the "file" as a blob and that’s part of the > requirements then > you need to store the file as a "blob" (in ML that would be a binary, or > large binary ). > But if the assumption is that "blob" is actually what your querying ... > that’s simply wrong ... and completely > useless to ask for a checksum of the file - even if you provided it who's > to say your storing a 1MB "blob" > As a binary But your ML XML document you store isn’t just "<bigfile/>" ... > You've satisfied the customer by giving them what they asked for but have > not solved any actual real problem, > And just added a bunch of work. You might as well just store the MD5 of > the original file and give that back. > > Even if you *did* ONLY store the blob (say in a file and didn’t bother > with ML at all) - you can satisfy the customer, > but unless your only requirement was to store the file. But your app could > just produce random results. > > In general - I find this a classic case of micro-management or > miss-communication of requirements. > Very common ... a customer wants something but the only way they know to > express it is in terms of the things they > know ... so instead of expressing the requirement like "provide a method > to validated you received the document correctly" and "provide a method to > validate that the application is returning the correct results for the > latest document" they ask "give me a checksum of the file". You can give > them what they ask for or you can > Give them what they really need. But if you just blindly follow what > they ask for you can end up with vastly more work, a bad product and a > customer who is unhappy, or misled. > > > When these cases come up - usually the intent is this > > 1) I want to make sure you *received* the file I sent correctly: > This is very valid. > Then store the MD5, Length, Timestamp as a property object with the > document. > > 2) If the core requirement is for you to store a "blob" ... that is > a document garneted > to be unchanged byte-wise and retrievable in its original exact byte > repetition > then store the document as a binary. > However you won't be able to make much use of it besides returning it. > > 3) If the requirement is to be a "blob store" AND a "Document > Database" both, > then you need to store both the binary and the parsed document. > But when asked for validation - the customer must understand that you're > not querying the blob, > you can return it on demand to prove you have it, but you need a different > definition of equality > to prove your database document is the same as the blob. > Also what does that prove ? You can prove it by end to end testing of > your application - making sure that any queries produce the expected > results. Anything less than that doesn’t prove much. > > But you can store both, validate that you stored the blob correctly, > retrieve it on demand, > And use the parsed document for querying. You can explain it similar to > loading it into memory. > You can't prove a Java Object (say a tree structure) is exactly the same > as the blob - that can only be done by higher level application testing. > > 4) If the requirement is that you Store a Blob but Also be able to > query it and Also be able > to verify that *what you are using to query* is byte-identical to the blob. > That question is invalid. You can't do both, the person asking for it > needs to understand that > or you need to not promise to do it because its either impossible or > pointless. > Any attempt to provide the customer what they asked for would simply be a > lie. > > 5) Mixed with all these, many people misunderstand documents > entirely - and actually believe that the file format "Is The Document" ... > in all meanings of the word. This is debatable, but staying away from pure > philosophy and abstract math -- in practice it generally a > misunderstanding. If you open even a .txt file and save it - without > changes. And the result has a different checksum, is the document the same > ? > What if the only difference is changing CR/LF to LF ? are they the same ? > What if its changed from ASCII to UTF8 ... are they the same ? > The answer lies in the context ... "For what purpose do you define > equality" > > Any answer that doesn’t have the context explicit is going to be wrong or > not useful. > > To end that long story :) > I strongly encourage you find out precisely what the real intent and > requirements are, > If you don’t, then any solution to solve the problem as stated is very > likely to be misguided and not solve the real problems. > > > > > > > > > > *From:**general-boun...@developer.marklogic.com* > <general-boun...@developer.marklogic.com> [mailto: > *general-boun...@developer.marklogic.com* > <general-boun...@developer.marklogic.com>] *On Behalf Of *David Ennis > * Sent:* Friday, July 18, 2014 5:23 AM > * To:* MarkLogic Developer Discussion > * Subject:* Re: [MarkLogic Dev General] MD5 - Hash Question > > HI. > > For what has been described, perhaps fn:deep-equal can help as it > more-or-less takes into account the 'pitfalls' listed by David Lee: > > - for attributes, order does not matter > - whitepace is ignored in element definitions as well as between elements > (essentially ignored if not an atomoc value, I would assume) > - (presumably because the comparison is done on the internal > representation), then it is also true that things like single-quote or > double-quote make no difference (not the case if you were doing a hash > check) > > Therefore, document are equal if all of the nodes, child nodes, etc are > present with the exact same attributes and the same values in all places - > while ignoring order of attributes as well as un-needed whitespace. > > In a Nutshell: > <foo > <bar a="b" c='d'>baz</bar></foo> > equals > <foo> <bar c='d' a='b'>baz</bar> </foo> > (and of course this follows for deep structures as well) > > So, perhaps a solution would be a check in MarkLogic: > A = internal doc > B = fetch doc from Hadoop via http/odbc/whatever and do a deep-diff on it. > (or, expose deep-diff via some ML API and then just have some system that > fetches form hadoop and asks ML if the doc matches the one it has) > > Kind Regards, > David Ennis > > On 17 July 2014 19:56, David Lee <*david....@marklogic.com* > <david....@marklogic.com>> wrote: > I will reply with a shorter version of what came up recently on another > list. > ML uses UTF8 internally for xs:string which is what xdmp:quote() returns, > But that’s misleading. Its irrelevant because the encoding of a string is > not exposed at the XQuery layer. > We could be using anything and the behavior would be the same. > The only time encoding comes into play is when serializing or > desterilizing from *bytes* (i.e to/from a file or text document) > > Unless your using text or binary documents then generating a hash on them > is pointless. > You're not going to get the same hash as the original file . This is true > for any processor of XML or JSON or structured documents that does any > parsing or serializing. > > > Documents stored in the database are not byte-equal to the source document. > > This is true at multiple levels. At the "store on the disk" level a > "Document" doesn’t resemble the source document at all, > > Any more than a CSV file resembles the block structure of an Oracle > partition - let alone byte equal. > > For some primitive document types - namely binary, what you get back > should be equal to what you put in - exactly, But even text documents might > undergo character set translation , Unicode normalization so what you get > back may Not be byte equal ... > > Structured docs are more complex, XML and JSON first undergo charset > translation like text, then they are parsed Into an internal node structure > - and then stored in a very concise format. > > Even ignoring the disk format and MarkLogic ... ALL XML and JSON > processors share this issue. > > The Text Serialized form of a document is not the same thing as its value. > You can rarely, if ever, do a round trip > > On JSON or XML and get back byte for byte what you started with - Its > *critical* to understand that the byte/text format > > Of documents is the transport layer, not the document model itself. And > with that concept, there are many equivalent ways to express the same > document model. Very simple example, in XML attributes have no ordering > guarantee (nor in JSON), > > Spaces between attributes like <foo a="b" c="d"/> are ignored so > <foo c="d" a="b/> is the same document. > > To wrap this up, calculating a hash of a document before storing it , and > after retrieving it doesn’t give you the answer you want. > The hashes will almost certainly be different - whether or not the > document "is the same thing" ... > > To provide better advise on how to compare for document equality I need > more specifics such as what format the document is in, and how you want to > define equality. > > > *From:**general-boun...@developer.marklogic.com* > <general-boun...@developer.marklogic.com> [mailto: > *general-boun...@developer.marklogic.com* > <general-boun...@developer.marklogic.com>] *On Behalf Of *Sujith > * Sent:* Thursday, July 17, 2014 11:20 AM > * To:* MarkLogic Developer Discussion > * Subject:* [MarkLogic Dev General] MD5 - Hash Question > > What is the default encoding that ML uses for xdmp:quote(). > > There is a daily job that loads hadoop ( Cloudera Dist ) with the files > that we have in ML using mlcp. Now we want to compare if both of them are > in Sync, so we are using md5 hash for validation. Initially we provided > Hadoop with our Hash and they came back saying that it didn't match with > their data. After doing some analysis we figured out that we should > explicitly specify the encoding as UTF-8 option in xdmp:quote as Java > Program on their end is doing the same. > > (: The Hash that didnot match :) > xquery version "1.0-ml"; > xdmp:md5(xdmp:quote(fn:doc("/sample.xml") ) > > > In other words what would be the default encoding xdmp:quote uses ( My > assumpotion is that by Default ML saves Data as UTF-8 encoding is no > encoding is specified and while it retrieves the documents the same would > be used. ) > > > (: the Hash that Match :) > > xquery version "1.0-ml"; > xdmp:md5(xdmp:quote(fn:doc("/sample.xml"),<options xmlns="xdmp:quote"> > <output-encoding>utf-8</output-encoding> > <omit-xml-declaration>yes</omit-xml-declaration> > </options>)) > > Any insight is very much appreciated. > > > -- > Thanks & Regards > SujithMaram > > _______________________________________________ > General mailing list > *General@developer.marklogic.com* <General@developer.marklogic.com> > *http://developer.marklogic.com/mailman/listinfo/general* > <http://developer.marklogic.com/mailman/listinfo/general> > > > _______________________________________________ > General mailing list > *General@developer.marklogic.com* <General@developer.marklogic.com> > *http://developer.marklogic.com/mailman/listinfo/general* > <http://developer.marklogic.com/mailman/listinfo/general> > > > > -- > Thanks & Regards > SujithMaram > _______________________________________________ > General mailing list > General@developer.marklogic.com > http://developer.marklogic.com/mailman/listinfo/general > > > > > > --- > > This e-mail may contain confidential and/or privileged information. If you > are not the intended recipient (or have received this e-mail in error) > please notify the sender immediately and delete this e-mail. Any > unauthorized copying, disclosure or distribution of the material in this > e-mail is strictly forbidden. > > Please refer to http://www.db.com/en/content/eu_disclosures.htm for > additional EU corporate and regulatory disclosures and to > http://www.db.com/unitedkingdom/content/privacy.htm for information about > privacy. > > > _______________________________________________ > General mailing list > General@developer.marklogic.com > http://developer.marklogic.com/mailman/listinfo/general > > -- Thanks & Regards SujithMaram
_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general