Tim, The suggestion below from Sascha is a good one. The other approach I¹ve take before is to perform a search in the repo for a given document and only if it does not exist would I insert it, otherwise perform an update or just log it as an ³error².
Thanks, Ron DiFrango Director / Architect | CapTech On 6/22/14, 5:37 AM, "Sascha Homeier" <[email protected]> wrote: >Hi Tim, > >you said you need to migrate the documents from FileNet to a CMIS >compliant server. >Is the CMIS compliant server your implementation? >If so you could calculate a Hash like MD5 over the content stream and >set it as the object ID. >Due to the CMIS spec this object ID needs to be unique. So it must be >ensured that no two objects with the same object ID exists in the same >CMIS repository which is equivalent to have two objects with the same >content stream. >This approach whould also ensure to not add equal documents in the future >after migration is done. >Nevertheless here you also need to find a performant way of determining >if an object with an ID already exists (and find a solution if the hash >is changed only by a timestamp inside the content stream etc.) >With about two million objects you maybe need to extend the RAM on the >migration machine to keep such many objects in memory and comparing it by >using Hashmaps and Hashtables with own implementations of equals() and >hashCode() ;) > >Anyway a stimulating task. I'm curious about the ideas of others here to >solve it in a performant way ;) > >Cheers >Sascha > >-----Ursprüngliche Nachricht----- >Von: Tim Webster [mailto:[email protected]] >Gesendet: Samstag, 21. Juni 2014 17:55 >An: [email protected] >Betreff: Re: document 'uniqueness' > >Hello, > >yes thanks for the suggestion - it sort of does that already with the >Spring Batch progress tracking, but it still won't prevent another >document being added to the repository that is identical to a previous >one if it somehow failed - like a JVM crash or power failure. Because >there is no transaction management for the CMIS part, you can't really >ensure this, except for a constraint in the repository itself. > >Anyway, yeah I think you're right and I need to look at FileNet >specifically. I just wasn't sure if I missed something and there was >something in the CMIS spec that I could use (e.g. some property or >something). > >Thanks, > >Tim > > > >On Fri, Jun 20, 2014 at 10:24 PM, Lucas, Mike <[email protected]> wrote: > >> I'm sure you've already thought of this, but couldn't your migration >> process just persist the legacy ids in a separate location (e.g. >> database table, possibly cached in memory for performance)? Then you >> would just need to check that for each document being migrated, to >> make sure that the same doc hasn't been seen previously. >> >> Not a CMIS related solution, but seems like it would work fine... >> >> The other option, as you suggest, is to see if FileNet supports a >> 'uniqueness' constraint for custom metadata properties. I believe >> Sharepoint does but not sure about FileNet. >> >> Thanks >> michael lucas | Senior Software Developer | Great-West Life >> >> >> -----Original Message----- >> From: Tim Webster [mailto:[email protected]] >> Sent: June 20, 2014 8:15 AM >> To: [email protected] >> Subject: document 'uniqueness' >> >> Hi, >> >> I am developing a migration process (using Spring Batch) to migrate >> documents from a legacy CMS into a CMIS-compliant system, and I need >> to ensure that duplicate documents are not created accidentally. >> >> However, our CMIS system (IBM FileNet) allows the addition of >> documents with the same name. Documents with identical values for >> cmis:name or cmis:contentStreamFilename are allowed. Even if this >> could be disabled (I don't know if it can or cannot), it is a business >> requirement and I wouldn't be able to. >> >> The only thing I can think of to prevent this is to save the 'legacy' >> ID of the document in a new CMIS property and somehow check that it >> doesn't already exist when adding a new document. However this will be >> very inefficient and slow down the migration (we're talking about up >> to 2 million documents). >> >> Ideally the 'uniqueness constraint' would be checked on the server and >> would throw an exception, which I could then deal with. >> >> Does anyone know of an easier way to do this, or is there anything I >> can make use of in the CMIS spec to help? >> >> Thanks, >>
