[rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Simone Busoli Thu, 20 Jan 2011 23:01:46 -0800

It was taken off the list but there was a requirement for a branching
operation which streams data correctly. Mail thread is below. There is a
pending pull request on the main repo.


On Fri, Jan 21, 2011 at 07:56, Simone Busoli <[email protected]>wrote:

> Cool, great to hear that!
>
> On Fri, Jan 21, 2011 at 06:42, Shannon Marsh <[email protected]> wrote:
>
>> Simone,
>>
>> We finally got to implement the MultiThreadedBranchingOperation in our ETL
>> process and it works very well.
>>
>> BEFORE:
>> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>> Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB.
>> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>> Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB.
>>
>> AFTER:
>> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>> Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB.
>> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>> Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB.
>>  Thanks again,
>>
>> Shannon
>> On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <[email protected]>wrote:
>>
>>> Hi Shannon, in the meanwhile I improved it a bit by removing the need for
>>> using the threaded pipeline and also noticed that recently a non-caching
>>> single threaded pipeline has been added so you can use any of the two. I
>>> also forwarded a pull request to the main repository.
>>>
>>> Regards, Simone
>>>
>>> On Sun, Jan 2, 2011 at 23:35, Shannon Marsh <[email protected]> wrote:
>>>
>>>>  Hi Simone,
>>>>
>>>>
>>>>
>>>> Thanks, that seems like a more robust solution.  I agree that there
>>>> could be a problem if there were too many child operations but in our case
>>>> our ETL process is fairly simplistic.  The problem was just the number of
>>>> records we were dealing with.   I guess it comes down to using the right
>>>> tool for the job.  This solution gives us another tool to choose from.
>>>>  Can’t wait to try out this solution when I’m back in the office next week.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Shannon
>>>>
>>>>
>>>>
>>>> *From:* Simone Busoli [mailto:[email protected]]
>>>> *Sent:* Sunday, 2 January 2011 10:51 AM
>>>>
>>>> *To:* Shannon Marsh
>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else
>>>> does not?
>>>>
>>>>
>>>>
>>>> Hi Shannon, I pushed a change which includes a new operation which
>>>> optimizes memory consumption in branching scenarios, while instead performs
>>>> worse than the other one when it comes to duration due to thread
>>>> synchronization (which BTW shouldn't represent a bit problem as long as you
>>>> don't branch to many children operations). It's called
>>>> MultiThreadedBranchingOperation. Take into account that it needs the
>>>> multi-threaded pipeline runner because the single threaded one relies on 
>>>> the
>>>> caching enumerable which caches all the input rows and exhibits the same
>>>> issue you have described.
>>>> My branch on github is here <https://github.com/simoneb/rhino-etl>.
>>>>
>>>> On Wed, Dec 29, 2010 at 12:43, Simone Busoli <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Shannon, thanks for the update, unfortunately your solution won't
>>>> work correctly in the general case, as you stated the chunking you have
>>>> implemented implies executing the operations more than once, which is not
>>>> the desired behavior. I will look into it today to find out if there is a
>>>> more general solution to the problem.
>>>>
>>>> Simone
>>>>
>>>>
>>>>
>>>> On Tue, Dec 28, 2010 at 00:49, Shannon Marsh <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Simone,
>>>>
>>>>
>>>>
>>>> I did manage to make some progress with this however I am holiday at the
>>>> moment so I haven’t been into the office to test if my solution works on a
>>>> large scale.
>>>>
>>>>
>>>>
>>>> I modified the BranchingOperation code to “chunk” the data coming
>>>> through the pipeline and was able to work around the issue with the
>>>> operations being called multiple times by adding a line to the
>>>> SqlBulkInsertOperation to check the dictionary before adding the pair
>>>> key/value in the “CreateInputSchema” method.  See attached files for
>>>> changes.
>>>>
>>>>
>>>>
>>>> These changes seem to make the Fibonacci Branching performance test pass
>>>> when setting the number of rows to over a million.  If I monitor the memory
>>>> usage while the test is running it seems to peak at a much lower amount.
>>>>
>>>>
>>>>
>>>> I will be back in the office on 10th January so I will be able to test
>>>> it with our ETL application then.  Rather than actually modify the Rhino 
>>>> ETL
>>>> source as I have done in testing,  I was planning on just extending these
>>>> operation classes and overriding the required methods.  Assuming that it 
>>>> all
>>>> works I would also look at making the “chunk” size configurable.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>>
>>>>
>>>> Shannon
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* Simone Busoli [mailto:[email protected]]
>>>> *Sent:* Monday, 27 December 2010 5:53 PM
>>>> *To:* Shannon Marsh
>>>>
>>>>
>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else
>>>> does not?
>>>>
>>>>
>>>>
>>>> Hi Shannon, any news about this issue?
>>>>
>>>> On Mon, Dec 13, 2010 at 21:55, Simone Busoli <[email protected]>
>>>> wrote:
>>>>
>>>> Sure, keep us informed of the progress, in any case in the next few days
>>>> I might find some time to look into it, too.
>>>>
>>>>
>>>>
>>>> On Sun, Dec 12, 2010 at 22:45, Shannon Marsh <[email protected]>
>>>> wrote:
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> I was looking at the code to see if I could batch the rows (chunking).
>>>> Mixed results so far.  I seem to have problems re-iterating through the
>>>> operations.  The first batch of rows go through perfectly but the 2nd batch
>>>> fails when trying to call the same instance of the operation again, failing
>>>> in the PrepareMapping method on SqlBulkInserOperation ("...key has already
>>>> been added, etc.").  I'll continue to investigate and get back to you when 
>>>> i
>>>> find a solution.  I am thinking I may need to clone the operations for each
>>>> batch.  In the mean time the memory upgrade to our server should get us out
>>>> of trouble.
>>>>
>>>> Regards,
>>>>
>>>> Shannon
>>>>
>>>> On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Shannon, I see your point. The tricky part here is that we need to
>>>> provide the same set of rows to each operation without iterating through
>>>> them more than once. I didn't try that, maybe branching them in batches is
>>>> doable, have you looked at the code?
>>>>
>>>>
>>>>
>>>> On Sun, Dec 12, 2010 at 01:05, Shannon Marsh <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Simone,
>>>>
>>>>
>>>>
>>>> Reading the entire post, zvolkov talks about the problem of “file is so
>>>> huge as to not fit into memory” and wanting “pulling and pushing one record
>>>> at a time but never trying to accumulate all records in memory”.    That is
>>>> what sparked my interest as it sounded exactly the problem we were
>>>> experiencing.
>>>>
>>>>
>>>>
>>>> Later zvolkov says “maybe cache only a few rows but not all”  and
>>>>   webpaul says “If you make an IEnumerable that copies the rows one at a
>>>> time and feed that to the operation.Execute”
>>>>
>>>>
>>>>
>>>> So my understanding of how the fix would work was that it would take a
>>>> copy of the row and serve it out to each branch consuming the row, then
>>>> repeat this for each row in the pipeline.  Something like this.
>>>>
>>>>
>>>>
>>>> Branch 1 – Operation 1 – Execute on Row 1.
>>>>
>>>> Branch 1 – Operation 2 – Execute on Row 1.
>>>>
>>>> Branch 1 – Operation n – Execute on Row 1.
>>>>
>>>>
>>>>
>>>> Branch 2 – Operation 1 – Execute on Row 1.
>>>>
>>>> Branch 2 – Operation 2 – Execute on Row 1.
>>>>
>>>> Branch 2 – Operation n – Execute on Row 1.
>>>>
>>>>
>>>>
>>>> Branch n – Operation 1 – Execute on Row 1.
>>>>
>>>> Branch n – Operation 2 – Execute on Row 1.
>>>>
>>>> Branch n – Operation n – Execute on Row 1.
>>>>
>>>>
>>>>
>>>> Branch 1 – Operation 1 – Execute on Row 2.
>>>>
>>>> Branch 1 – Operation 2 – Execute on Row 2.
>>>>
>>>> Branch 1 – Operation n – Execute on Row 2.
>>>>
>>>>
>>>>
>>>> Branch 2 – Operation 1 – Execute on Row 2.
>>>>
>>>> Branch 2 – Operation 2 – Execute on Row 2.
>>>>
>>>> Branch 2 – Operation n – Execute on Row 2.
>>>>
>>>>
>>>>
>>>> Branch n – Operation 1 – Execute on Row 2.
>>>>
>>>> Branch n – Operation 2 – Execute on Row 2.
>>>>
>>>> Branch n – Operation n – Execute on Row 2.
>>>>
>>>>
>>>>
>>>> …etc, etc… for each row.
>>>>
>>>>
>>>>
>>>> With that approach I don’t think you would ever need to accumulate rows
>>>> in memory.   To be honest though I haven’t considered the technicalities of
>>>> implementing this and whether it is achievable with the IEnumerable model.
>>>> This was just how I imagined it would work after reading this post.
>>>>
>>>>
>>>>
>>>> What seems to happen is that streaming does occur, but only for the
>>>> first branch.  The second and subsequent branches have to wait until the
>>>> first branch has consumed all the rows before they start, meaning that all
>>>> the rows need to be cached in RAM during the first branch to be available
>>>> for the later branches.  Like this…
>>>>
>>>>
>>>>
>>>> Branch 1 – Operation 1 – Execute on Row 1.
>>>>
>>>> Branch 1 – Operation 2 – Execute on Row 1.
>>>>
>>>> Branch 1 – Operation n – Execute on Row 1.
>>>>
>>>>
>>>>
>>>> Branch 1 – Operation 1 – Execute on Row 2.
>>>>
>>>> Branch 1 – Operation 2 – Execute on Row 2.
>>>>
>>>> Branch 1 – Operation n – Execute on Row 2.
>>>>
>>>>
>>>>
>>>> (When branch 1 is complete - all rows cached in RAM)
>>>>
>>>>
>>>>
>>>> Branch 2 – Operation 1 – Execute on Row 1.
>>>>
>>>> Branch 2 – Operation 2 – Execute on Row 1.
>>>>
>>>> Branch 2 – Operation n – Execute on Row 1.
>>>>
>>>>
>>>>
>>>> Branch 2 – Operation 1 – Execute on Row 2.
>>>>
>>>> Branch 2 – Operation 2 – Execute on Row 2.
>>>>
>>>> Branch 2 – Operation n – Execute on Row 2.
>>>>
>>>>
>>>>
>>>> Branch n – Operation 1 – Execute on Row 1.
>>>>
>>>> Branch n – Operation 2 – Execute on Row 1.
>>>>
>>>> Branch n – Operation n – Execute on Row 1.
>>>>
>>>>
>>>>
>>>> Branch n – Operation 1 – Execute on Row 2.
>>>>
>>>> Branch n – Operation 2 – Execute on Row 2.
>>>>
>>>> Branch n – Operation n – Execute on Row 2.
>>>>
>>>>
>>>>
>>>> So again, I may have completely misunderstood  how it should work?  Or
>>>> could we be doing something wrong that causes all rows to cache in memory?
>>>>
>>>>
>>>>
>>>> Thanks Again,
>>>>
>>>>
>>>>
>>>> Shannon
>>>>
>>>>
>>>>
>>>> *From:* Simone B. [mailto:[email protected]]
>>>> *Sent:* Friday, 10 December 2010 11:40 AM
>>>> *To:* ShannonM
>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else
>>>> does not?
>>>>
>>>>
>>>>
>>>> Hi Shannon, can you explain what is the behavior you would expect?
>>>>
>>>> On Fri, Dec 10, 2010 at 01:19, ShannonM <[email protected]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I realise this is an old post but we seem to be experiencing a similar
>>>> issue with memory usage.  Our ETL process works with approximately 1
>>>> million records per source table.  For a standard straight through
>>>> process there are no problems.  The rows just stream through and load
>>>> to the database and memory peaks at approx 110MB.  However, wherever
>>>> we use a BranchingOperation in our process the rows accumulate in
>>>> memory while executing the first branch (assuming so they are
>>>> available for the remaining branches).  The subsequent branches do not
>>>> execute until the first one has completed.
>>>>
>>>> The problem is that this usually consumes approx 1.6GB of memory and
>>>> depending on other processes running on our server can sometimes cause
>>>> a memory exception "System.AccessViolationException: Attempted to read
>>>> or write protected memory. This is often an indication that other
>>>> memory is corrupt.". We can work-around this issue by re-starting
>>>> services (eg. SQL Server) or re-booting the server prior to running
>>>> the ETL process to ensure there are no rougue processes hogging
>>>> memory.  We are also considering moving to at 64Bit OS and adding more
>>>> RAM.
>>>>
>>>> I investigated the Fibonacci Braching tests in the Rhino ETL source
>>>> and it seems to behave exactly as I described.  If I debug the test
>>>> named CanBranchThePipelineEfficiently() I can actually duplicate the
>>>> scenario and you can see that all the rows are cached in memory after
>>>> the first branch executes.
>>>>
>>>> Your post seems to indicate that you fixed this issue by using the
>>>> cahcing enumerable.  I am misunderstanding your post or could we be
>>>> doing something wrong?
>>>>
>>>> While we will probably go ahead with the server ugrage anyway it would
>>>> be nice to make our ETL process more effienct and not consume so much
>>>> memory if it can be avoided.
>>>>
>>>>
>>>>
>>>> On Jul 5 2009, 9:22 pm, Simone Busoli <[email protected]> wrote:
>>>> > Fixed. What it's doing now is wrap the input enumerable into a caching
>>>> > enumerable and then feed a clone of each row into the operations
>>>> making up
>>>> > the branch.
>>>> >
>>>> >
>>>> >
>>>>
>>>> > On Thu, Jul 2, 2009 at 16:53, webpaul <[email protected]> wrote:
>>>> >
>>>> > > Looking at the code, I think the only reason it isn't yield
>>>> returning
>>>> > > right now is so it can copy the rows. If you make an IEnumerable
>>>> that
>>>> > > copies the rows one at a time and feed that to the operation.Execute
>>>> I
>>>> > > think that is all that is needed.
>>>> >
>>>> > > On Jul 2, 7:17 am, zvolkov <[email protected]> wrote:
>>>> > > > uhh... maybe cache only a few rows but not all? Assuming I branch
>>>> in
>>>> > > > 2, at worst I will need to cache as many rows as there is the
>>>> > > > disparity between the two consumers of my two output streams...
>>>> Makes
>>>>
>>>> > > > sense?- Hide quoted text -
>>>> >
>>>> > - Show quoted text -
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rhino-tools-dev?hl=en.

[rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Reply via email to