Re: [rhino-tools-dev] Rhino.Etl and nested collections

Nathan Palmer Thu, 23 May 2013 10:13:48 -0700

So first I'll say that I've personally used Rhino Etl with efficient speed
on translations of up to 1 billion rows. I'm curious where the speed bump
in your example is coming from. You are effectively caching each "bucket"
of data and then passing that along. How many rows are you dealing with and
what are the operations doing with them afterward? The only time I've done
this type of thing to speed it up is when I was dealing with a bottleneck
on serialization. Even then though it was fairly minimal of a difference.


Nathan Palmer


On Thu, May 23, 2013 at 11:25 AM, TJ Roche <[email protected]> wrote:

> Well let me just say ETL is hard ;)
> The funneling operation turned out to be a resounding failure.  Either
> through my own lack of skill/knowledge I just couldn't get it to be
> reasonably (at least what I deem reasonably) performant on a data set of
> any real size.
>
> So I ended up nixing the funneling operation and decided on a different
> tactic.  I created something called FetchAsCollection.  Which returns back
> the output of a db command through a merge rows function that takes in row
> and IEnumerable<Row> thus allowing you to deposit the collection and parse
> it however you needed.
>
> I may have reinvented the wheel for some pieces but it appears to work
> fairly well for my uses.   My next task involves creating an AntiJoin for
> sql so that I can dedupe any existing records and use the sql bulk insert.
>
> Here is the code for FetchAsCollection, should anyone want it, also feel
> free to critique, change, improve etc.
> https://gist.github.com/anonymous/6c30878329d4c5817731
>
>
> On Monday, May 13, 2013 12:06:24 PM UTC-7, Nathan Palmer wrote:
>
>> Awesome. I'd love to see the finished version.
>>
>> Nathan Palmer
>>
>>
>> On Mon, May 13, 2013 at 1:41 PM, TJ Roche <[email protected]> wrote:
>>
>>> I think the solution is going to be to create a "FunnelingOperation"
>>> (opposite of a branching operation), and start at the furthest leaf of the
>>> collection tree and fill up from the bottom.  I have to finish writing it
>>> and testing it but once I am happy with it I will post the code.
>>>
>>>
>>> On Friday, May 10, 2013 1:58:06 PM UTC-7, Nathan Palmer wrote:
>>>
>>>> Couple of things I can think of off hand
>>>>
>>>> 1) Create an operation to yield out Frames. Then create an
>>>> AbstractOperation that re-queries sql for the corresponding Stack's and
>>>> Tasks for each frame and then wires those together into one document.
>>>>
>>>> 2) Flatten the Frame/Stack/Task relationship out of sql and order it by
>>>> Frame,Stack,Task. When the "FrameId" changes create a new frame document
>>>> and as you are looping through append the stacks and tasks to the their
>>>> corresponding collections. You'll need to do the same "StackId" changes
>>>> logic for stacks as well but since it's ordered it should construct it
>>>> correctly.
>>>>
>>>> Nathan Palmer
>>>>
>>>>
>>>> On Thu, May 9, 2013 at 4:04 PM, TJ Roche <[email protected]> wrote:
>>>>
>>>>> Hello all,
>>>>> Using rhino.etl to handle a migration from sql server into mongodb.  I
>>>>> have a fairly complicated document that i am trying to pull out of several
>>>>> tables in sql and put into mongo.  I am getting hung up on the individual
>>>>> processing of the items in the pipelines.
>>>>>
>>>>> the structure looks similar to this
>>>>>
>>>>> Frame has collection of Stacks
>>>>> a Stack has a collection of Tasks
>>>>> a Task has a collection of Answers.
>>>>>
>>>>> The whole Frame document will be inserted into mongo with the sub
>>>>> collections.
>>>>>
>>>>> Do I  need to create an abstract operation to fetch data and just
>>>>> yield the entire collection?  Do I use a Join operation? How about a 
>>>>> Nested
>>>>> Loops Join ( I am not really 100% sure what this is used for)?
>>>>>
>>>>>
>>>>> the insertion into mongo actually is fairly seemless for most of the
>>>>> standard pieces but I am struggling a bit here.
>>>>>
>>>>>
>>>>> Any help would be greatly appreciated.
>>>>> Thanks
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Rhino Tools Dev" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to rhino-tools-d...@**googlegroups.**com.
>>>>> To post to this group, send email to rhino-t...@googlegroups.**com.
>>>>>
>>>>> Visit this group at http://groups.google.com/**group**
>>>>> /rhino-tools-dev?hl=en<http://groups.google.com/group/rhino-tools-dev?hl=en>
>>>>> .
>>>>> For more options, visit 
>>>>> https://groups.google.com/**grou**ps/opt_out<https://groups.google.com/groups/opt_out>
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Rhino Tools Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
>
> Visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [rhino-tools-dev] Rhino.Etl and nested collections

Reply via email to