I have hundreds of batches of 30K records.
I set up a flow to ingest them via a CSV collector (changed from the example to
create individual docs out of the rows).
I also add a collection to the documents. So, each document in a batch should
have two collections, one I assign and one that IS assigns
"/tickets/ticket/...".
There are duplicates within and across the batches. These are not exact
duplicates, each is slightly different in content, but with same unique
identifier.
What I want to do is create a master document that has each of the duplicates
embed in it. I already wrote the XQuery that does this and it works (when I
run it by hand).
What I'm looking for is the mechanism to recognize when a new batch appears and
be able to use the collection that IS assigns ("/tickets/ticket/...") to
conglomerate the duplicates within the batch, then conglomerate them further
within the collection I've assigned. After I conglomerate a document into the
master, I remove the document.
CPF seems to operate on a "document" level and I don't really want that.
I thought about the transform step in IS but that's also at a "document" level.
Incidentally, there's yet another step that I want to do. I have set up a flow
for each content type. Thus, after a batch been conglomerated amongst itself
and then amongst its collection, I want to conglomerate it amongst
"collection()".
I suppose I could also add these step to the custom collector - I think I'd
have access to the "/tickets/ticket/..." collection there...
Seems like it'd be a CPF pipeline I could set up if I could create an execute
condition module that could recognize when a new collection appeared in the DB?
I guess I could keep a "running list" document around and write the new IS
collection into it each time a collector completes and base the CPF on when
that document updates?
Note: because the flows for each content type are independent (because I want
to assign a specific collection to each content type and I don't see a way to
assign a different collection based on some criteria in the existing IS
screens) I guess I would actually have to have separate CPFs and "running list"
documents for each Flow.
Thanks for any advice,
David
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general