Re: [rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Nathan Palmer Thu, 27 Jan 2011 07:49:00 -0800

This has been reviewed and pulled into the main repo. You'll find the
changes on build 24 here:
http://builds.hibernatingrhinos.com/builds/Rhino-ETL


<http://builds.hibernatingrhinos.com/builds/Rhino-ETL>Nathan

On Fri, Jan 21, 2011 at 5:51 AM, Nathan Palmer <[email protected]> wrote:

> I did see the pull request. I'm out of town at the moment and probably
> won't get a chance to review it until the first of next week.
>
> Nathan Palmer
>
> Sent from my Phone
>
> On Jan 21, 2011, at 2:01 AM, Simone Busoli <[email protected]>
> wrote:
>
> It was taken off the list but there was a requirement for a branching
> operation which streams data correctly. Mail thread is below. There is a
> pending pull request on the main repo.
>
> On Fri, Jan 21, 2011 at 07:56, Simone Busoli < <[email protected]>
> [email protected]> wrote:
>
>> Cool, great to hear that!
>>
>> On Fri, Jan 21, 2011 at 06:42, Shannon Marsh < <[email protected]>
>> [email protected]> wrote:
>>
>>> Simone,
>>>
>>> We finally got to implement the MultiThreadedBranchingOperation in our
>>> ETL process and it works very well.
>>>
>>> BEFORE:
>>> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>>> Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB.
>>> 2011-01-21 04:04:19,410 [1] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>>> Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB.
>>>
>>> AFTER:
>>> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>>> Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB.
>>> 2011-01-20 18:54:30,500 [10] INFO  WA.LS.Migration.ETL.Runner [(null)] -
>>> Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB.
>>>  Thanks again,
>>>
>>> Shannon
>>> On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <<[email protected]>
>>> [email protected]> wrote:
>>>
>>>> Hi Shannon, in the meanwhile I improved it a bit by removing the need
>>>> for using the threaded pipeline and also noticed that recently a 
>>>> non-caching
>>>> single threaded pipeline has been added so you can use any of the two. I
>>>> also forwarded a pull request to the main repository.
>>>>
>>>> Regards, Simone
>>>>
>>>> On Sun, Jan 2, 2011 at 23:35, Shannon Marsh < <[email protected]>
>>>> [email protected]> wrote:
>>>>
>>>>>  Hi Simone,
>>>>>
>>>>>
>>>>>
>>>>> Thanks, that seems like a more robust solution.  I agree that there
>>>>> could be a problem if there were too many child operations but in our case
>>>>> our ETL process is fairly simplistic.  The problem was just the number of
>>>>> records we were dealing with.   I guess it comes down to using the right
>>>>> tool for the job.  This solution gives us another tool to choose from.
>>>>>  Can’t wait to try out this solution when I’m back in the office next 
>>>>> week.
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Shannon
>>>>>
>>>>>
>>>>>
>>>>> *From:* Simone Busoli [mailto: <[email protected]>
>>>>> [email protected]]
>>>>> *Sent:* Sunday, 2 January 2011 10:51 AM
>>>>>
>>>>> *To:* Shannon Marsh
>>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What
>>>>> else does not?
>>>>>
>>>>>
>>>>>
>>>>> Hi Shannon, I pushed a change which includes a new operation which
>>>>> optimizes memory consumption in branching scenarios, while instead 
>>>>> performs
>>>>> worse than the other one when it comes to duration due to thread
>>>>> synchronization (which BTW shouldn't represent a bit problem as long as 
>>>>> you
>>>>> don't branch to many children operations). It's called
>>>>> MultiThreadedBranchingOperation. Take into account that it needs the
>>>>> multi-threaded pipeline runner because the single threaded one relies on 
>>>>> the
>>>>> caching enumerable which caches all the input rows and exhibits the same
>>>>> issue you have described.
>>>>> My branch on github is here <https://github.com/simoneb/rhino-etl>.
>>>>>
>>>>> On Wed, Dec 29, 2010 at 12:43, Simone Busoli <<[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi Shannon, thanks for the update, unfortunately your solution won't
>>>>> work correctly in the general case, as you stated the chunking you have
>>>>> implemented implies executing the operations more than once, which is not
>>>>> the desired behavior. I will look into it today to find out if there is a
>>>>> more general solution to the problem.
>>>>>
>>>>> Simone
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 28, 2010 at 00:49, Shannon Marsh < <[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi Simone,
>>>>>
>>>>>
>>>>>
>>>>> I did manage to make some progress with this however I am holiday at
>>>>> the moment so I haven’t been into the office to test if my solution works 
>>>>> on
>>>>> a large scale.
>>>>>
>>>>>
>>>>>
>>>>> I modified the BranchingOperation code to “chunk” the data coming
>>>>> through the pipeline and was able to work around the issue with the
>>>>> operations being called multiple times by adding a line to the
>>>>> SqlBulkInsertOperation to check the dictionary before adding the pair
>>>>> key/value in the “CreateInputSchema” method.  See attached files for
>>>>> changes.
>>>>>
>>>>>
>>>>>
>>>>> These changes seem to make the Fibonacci Branching performance test
>>>>> pass when setting the number of rows to over a million.  If I monitor the
>>>>> memory usage while the test is running it seems to peak at a much lower
>>>>> amount.
>>>>>
>>>>>
>>>>>
>>>>> I will be back in the office on 10th January so I will be able to test
>>>>> it with our ETL application then.  Rather than actually modify the Rhino 
>>>>> ETL
>>>>> source as I have done in testing,  I was planning on just extending these
>>>>> operation classes and overriding the required methods.  Assuming that it 
>>>>> all
>>>>> works I would also look at making the “chunk” size configurable.
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>>
>>>>> Shannon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *From:* Simone Busoli [mailto: <[email protected]>
>>>>> [email protected]]
>>>>> *Sent:* Monday, 27 December 2010 5:53 PM
>>>>> *To:* Shannon Marsh
>>>>>
>>>>>
>>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What
>>>>> else does not?
>>>>>
>>>>>
>>>>>
>>>>> Hi Shannon, any news about this issue?
>>>>>
>>>>> On Mon, Dec 13, 2010 at 21:55, Simone Busoli <<[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Sure, keep us informed of the progress, in any case in the next few
>>>>> days I might find some time to look into it, too.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 12, 2010 at 22:45, Shannon Marsh < <[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> I was looking at the code to see if I could batch the rows (chunking).
>>>>> Mixed results so far.  I seem to have problems re-iterating through the
>>>>> operations.  The first batch of rows go through perfectly but the 2nd 
>>>>> batch
>>>>> fails when trying to call the same instance of the operation again, 
>>>>> failing
>>>>> in the PrepareMapping method on SqlBulkInserOperation ("...key has already
>>>>> been added, etc.").  I'll continue to investigate and get back to you 
>>>>> when i
>>>>> find a solution.  I am thinking I may need to clone the operations for 
>>>>> each
>>>>> batch.  In the mean time the memory upgrade to our server should get us 
>>>>> out
>>>>> of trouble.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Shannon
>>>>>
>>>>> On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <<[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi Shannon, I see your point. The tricky part here is that we need to
>>>>> provide the same set of rows to each operation without iterating through
>>>>> them more than once. I didn't try that, maybe branching them in batches is
>>>>> doable, have you looked at the code?
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 12, 2010 at 01:05, Shannon Marsh < <[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi Simone,
>>>>>
>>>>>
>>>>>
>>>>> Reading the entire post, zvolkov talks about the problem of “file is so
>>>>> huge as to not fit into memory” and wanting “pulling and pushing one 
>>>>> record
>>>>> at a time but never trying to accumulate all records in memory”.    That 
>>>>> is
>>>>> what sparked my interest as it sounded exactly the problem we were
>>>>> experiencing.
>>>>>
>>>>>
>>>>>
>>>>> Later zvolkov says “maybe cache only a few rows but not all”  and
>>>>>   webpaul says “If you make an IEnumerable that copies the rows one at a
>>>>> time and feed that to the operation.Execute”
>>>>>
>>>>>
>>>>>
>>>>> So my understanding of how the fix would work was that it would take a
>>>>> copy of the row and serve it out to each branch consuming the row, then
>>>>> repeat this for each row in the pipeline.  Something like this.
>>>>>
>>>>>
>>>>>
>>>>> Branch 1 – Operation 1 – Execute on Row 1.
>>>>>
>>>>> Branch 1 – Operation 2 – Execute on Row 1.
>>>>>
>>>>> Branch 1 – Operation n – Execute on Row 1.
>>>>>
>>>>>
>>>>>
>>>>> Branch 2 – Operation 1 – Execute on Row 1.
>>>>>
>>>>> Branch 2 – Operation 2 – Execute on Row 1.
>>>>>
>>>>> Branch 2 – Operation n – Execute on Row 1.
>>>>>
>>>>>
>>>>>
>>>>> Branch n – Operation 1 – Execute on Row 1.
>>>>>
>>>>> Branch n – Operation 2 – Execute on Row 1.
>>>>>
>>>>> Branch n – Operation n – Execute on Row 1.
>>>>>
>>>>>
>>>>>
>>>>> Branch 1 – Operation 1 – Execute on Row 2.
>>>>>
>>>>> Branch 1 – Operation 2 – Execute on Row 2.
>>>>>
>>>>> Branch 1 – Operation n – Execute on Row 2.
>>>>>
>>>>>
>>>>>
>>>>> Branch 2 – Operation 1 – Execute on Row 2.
>>>>>
>>>>> Branch 2 – Operation 2 – Execute on Row 2.
>>>>>
>>>>> Branch 2 – Operation n – Execute on Row 2.
>>>>>
>>>>>
>>>>>
>>>>> Branch n – Operation 1 – Execute on Row 2.
>>>>>
>>>>> Branch n – Operation 2 – Execute on Row 2.
>>>>>
>>>>> Branch n – Operation n – Execute on Row 2.
>>>>>
>>>>>
>>>>>
>>>>> …etc, etc… for each row.
>>>>>
>>>>>
>>>>>
>>>>> With that approach I don’t think you would ever need to accumulate rows
>>>>> in memory.   To be honest though I haven’t considered the technicalities 
>>>>> of
>>>>> implementing this and whether it is achievable with the IEnumerable model.
>>>>> This was just how I imagined it would work after reading this post.
>>>>>
>>>>>
>>>>>
>>>>> What seems to happen is that streaming does occur, but only for the
>>>>> first branch.  The second and subsequent branches have to wait until the
>>>>> first branch has consumed all the rows before they start, meaning that all
>>>>> the rows need to be cached in RAM during the first branch to be available
>>>>> for the later branches.  Like this…
>>>>>
>>>>>
>>>>>
>>>>> Branch 1 – Operation 1 – Execute on Row 1.
>>>>>
>>>>> Branch 1 – Operation 2 – Execute on Row 1.
>>>>>
>>>>> Branch 1 – Operation n – Execute on Row 1.
>>>>>
>>>>>
>>>>>
>>>>> Branch 1 – Operation 1 – Execute on Row 2.
>>>>>
>>>>> Branch 1 – Operation 2 – Execute on Row 2.
>>>>>
>>>>> Branch 1 – Operation n – Execute on Row 2.
>>>>>
>>>>>
>>>>>
>>>>> (When branch 1 is complete - all rows cached in RAM)
>>>>>
>>>>>
>>>>>
>>>>> Branch 2 – Operation 1 – Execute on Row 1.
>>>>>
>>>>> Branch 2 – Operation 2 – Execute on Row 1.
>>>>>
>>>>> Branch 2 – Operation n – Execute on Row 1.
>>>>>
>>>>>
>>>>>
>>>>> Branch 2 – Operation 1 – Execute on Row 2.
>>>>>
>>>>> Branch 2 – Operation 2 – Execute on Row 2.
>>>>>
>>>>> Branch 2 – Operation n – Execute on Row 2.
>>>>>
>>>>>
>>>>>
>>>>> Branch n – Operation 1 – Execute on Row 1.
>>>>>
>>>>> Branch n – Operation 2 – Execute on Row 1.
>>>>>
>>>>> Branch n – Operation n – Execute on Row 1.
>>>>>
>>>>>
>>>>>
>>>>> Branch n – Operation 1 – Execute on Row 2.
>>>>>
>>>>> Branch n – Operation 2 – Execute on Row 2.
>>>>>
>>>>> Branch n – Operation n – Execute on Row 2.
>>>>>
>>>>>
>>>>>
>>>>> So again, I may have completely misunderstood  how it should work?  Or
>>>>> could we be doing something wrong that causes all rows to cache in memory?
>>>>>
>>>>>
>>>>>
>>>>> Thanks Again,
>>>>>
>>>>>
>>>>>
>>>>> Shannon
>>>>>
>>>>>
>>>>>
>>>>> *From:* Simone B. [mailto: <[email protected]>
>>>>> [email protected]]
>>>>> *Sent:* Friday, 10 December 2010 11:40 AM
>>>>> *To:* ShannonM
>>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What
>>>>> else does not?
>>>>>
>>>>>
>>>>>
>>>>> Hi Shannon, can you explain what is the behavior you would expect?
>>>>>
>>>>> On Fri, Dec 10, 2010 at 01:19, ShannonM < <[email protected]>
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I realise this is an old post but we seem to be experiencing a similar
>>>>> issue with memory usage.  Our ETL process works with approximately 1
>>>>> million records per source table.  For a standard straight through
>>>>> process there are no problems.  The rows just stream through and load
>>>>> to the database and memory peaks at approx 110MB.  However, wherever
>>>>> we use a BranchingOperation in our process the rows accumulate in
>>>>> memory while executing the first branch (assuming so they are
>>>>> available for the remaining branches).  The subsequent branches do not
>>>>> execute until the first one has completed.
>>>>>
>>>>> The problem is that this usually consumes approx 1.6GB of memory and
>>>>> depending on other processes running on our server can sometimes cause
>>>>> a memory exception "System.AccessViolationException: Attempted to read
>>>>> or write protected memory. This is often an indication that other
>>>>> memory is corrupt.". We can work-around this issue by re-starting
>>>>> services (eg. SQL Server) or re-booting the server prior to running
>>>>> the ETL process to ensure there are no rougue processes hogging
>>>>> memory.  We are also considering moving to at 64Bit OS and adding more
>>>>> RAM.
>>>>>
>>>>> I investigated the Fibonacci Braching tests in the Rhino ETL source
>>>>> and it seems to behave exactly as I described.  If I debug the test
>>>>> named CanBranchThePipelineEfficiently() I can actually duplicate the
>>>>> scenario and you can see that all the rows are cached in memory after
>>>>> the first branch executes.
>>>>>
>>>>> Your post seems to indicate that you fixed this issue by using the
>>>>> cahcing enumerable.  I am misunderstanding your post or could we be
>>>>> doing something wrong?
>>>>>
>>>>> While we will probably go ahead with the server ugrage anyway it would
>>>>> be nice to make our ETL process more effienct and not consume so much
>>>>> memory if it can be avoided.
>>>>>
>>>>>
>>>>>
>>>>> On Jul 5 2009, 9:22 pm, Simone Busoli <[email protected]> wrote:
>>>>> > Fixed. What it's doing now is wrap the input enumerable into a
>>>>> caching
>>>>> > enumerable and then feed a clone of each row into the operations
>>>>> making up
>>>>> > the branch.
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> > On Thu, Jul 2, 2009 at 16:53, webpaul <[email protected]> wrote:
>>>>> >
>>>>> > > Looking at the code, I think the only reason it isn't yield
>>>>> returning
>>>>> > > right now is so it can copy the rows. If you make an IEnumerable
>>>>> that
>>>>> > > copies the rows one at a time and feed that to the
>>>>> operation.Execute I
>>>>> > > think that is all that is needed.
>>>>> >
>>>>> > > On Jul 2, 7:17 am, zvolkov <[email protected]> wrote:
>>>>> > > > uhh... maybe cache only a few rows but not all? Assuming I branch
>>>>> in
>>>>> > > > 2, at worst I will need to cache as many rows as there is the
>>>>> > > > disparity between the two consumers of my two output streams...
>>>>> Makes
>>>>>
>>>>> > > > sense?- Hide quoted text -
>>>>> >
>>>>> > - Show quoted text -
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "Rhino Tools Dev" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/rhino-tools-dev?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rhino-tools-dev?hl=en.

Re: [rhino-tools-dev] Re: Rhino ETL: BranchingOperation does not stream. What else does not?

Reply via email to