I have hundreds of batches of 30K records.
I set up a flow to ingest them via a CSV collector (changed from the example to 
create individual docs out of the rows).
I also add a collection to the documents.  So, each document in a batch should 
have two collections, one I assign and one that IS assigns 
"/tickets/ticket/...".
There are duplicates within and across the batches.  These are not exact 
duplicates, each is slightly different in content, but with same unique 
identifier.
What I want to do is create a master document that has each of the duplicates 
embed in it.  I already wrote the XQuery that does this and it works (when I 
run it by hand).

What I'm looking for is the mechanism to recognize when a new batch appears and 
be able to use the collection that IS assigns ("/tickets/ticket/...") to 
conglomerate the duplicates within the batch, then conglomerate them further 
within the collection I've assigned.  After I conglomerate a document into the 
master, I remove the document.
CPF seems to operate on a "document" level and I don't really want that.
I thought about the transform step in IS but that's also at a "document" level.

Incidentally, there's yet another step that I want to do.  I have set up a flow 
for each content type.  Thus, after a batch been conglomerated amongst itself 
and then amongst its collection, I want to conglomerate it amongst 
"collection()".

I suppose I could also add these step to the custom collector - I think I'd 
have access to the "/tickets/ticket/..." collection there...

Seems like it'd be a CPF pipeline I could set up if I could create an execute 
condition module that could recognize when a new collection appeared in the DB?

I guess I could keep a "running list" document around and write the new IS 
collection into it each time a collector completes and base the CPF on when 
that document updates?

Note: because the flows for each content type are independent (because I want 
to assign a specific collection to each content type and I don't see a way to 
assign a different collection based on some criteria in the existing IS 
screens) I guess I would actually have to have separate CPFs and "running list" 
documents for each Flow.

Thanks for any advice,
David

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to