It was taken off the list but there was a requirement for a branching operation which streams data correctly. Mail thread is below. There is a pending pull request on the main repo.
On Fri, Jan 21, 2011 at 07:56, Simone Busoli <[email protected]>wrote: > Cool, great to hear that! > > On Fri, Jan 21, 2011 at 06:42, Shannon Marsh <[email protected]> wrote: > >> Simone, >> >> We finally got to implement the MultiThreadedBranchingOperation in our ETL >> process and it works very well. >> >> BEFORE: >> 2011-01-21 04:04:19,410 [1] INFO WA.LS.Migration.ETL.Runner [(null)] - >> Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB. >> 2011-01-21 04:04:19,410 [1] INFO WA.LS.Migration.ETL.Runner [(null)] - >> Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB. >> >> AFTER: >> 2011-01-20 18:54:30,500 [10] INFO WA.LS.Migration.ETL.Runner [(null)] - >> Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB. >> 2011-01-20 18:54:30,500 [10] INFO WA.LS.Migration.ETL.Runner [(null)] - >> Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB. >> Thanks again, >> >> Shannon >> On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <[email protected]>wrote: >> >>> Hi Shannon, in the meanwhile I improved it a bit by removing the need for >>> using the threaded pipeline and also noticed that recently a non-caching >>> single threaded pipeline has been added so you can use any of the two. I >>> also forwarded a pull request to the main repository. >>> >>> Regards, Simone >>> >>> On Sun, Jan 2, 2011 at 23:35, Shannon Marsh <[email protected]> wrote: >>> >>>> Hi Simone, >>>> >>>> >>>> >>>> Thanks, that seems like a more robust solution. I agree that there >>>> could be a problem if there were too many child operations but in our case >>>> our ETL process is fairly simplistic. The problem was just the number of >>>> records we were dealing with. I guess it comes down to using the right >>>> tool for the job. This solution gives us another tool to choose from. >>>> Can’t wait to try out this solution when I’m back in the office next week. >>>> >>>> >>>> >>>> Regards, >>>> >>>> Shannon >>>> >>>> >>>> >>>> *From:* Simone Busoli [mailto:[email protected]] >>>> *Sent:* Sunday, 2 January 2011 10:51 AM >>>> >>>> *To:* Shannon Marsh >>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else >>>> does not? >>>> >>>> >>>> >>>> Hi Shannon, I pushed a change which includes a new operation which >>>> optimizes memory consumption in branching scenarios, while instead performs >>>> worse than the other one when it comes to duration due to thread >>>> synchronization (which BTW shouldn't represent a bit problem as long as you >>>> don't branch to many children operations). It's called >>>> MultiThreadedBranchingOperation. Take into account that it needs the >>>> multi-threaded pipeline runner because the single threaded one relies on >>>> the >>>> caching enumerable which caches all the input rows and exhibits the same >>>> issue you have described. >>>> My branch on github is here <https://github.com/simoneb/rhino-etl>. >>>> >>>> On Wed, Dec 29, 2010 at 12:43, Simone Busoli <[email protected]> >>>> wrote: >>>> >>>> Hi Shannon, thanks for the update, unfortunately your solution won't >>>> work correctly in the general case, as you stated the chunking you have >>>> implemented implies executing the operations more than once, which is not >>>> the desired behavior. I will look into it today to find out if there is a >>>> more general solution to the problem. >>>> >>>> Simone >>>> >>>> >>>> >>>> On Tue, Dec 28, 2010 at 00:49, Shannon Marsh <[email protected]> >>>> wrote: >>>> >>>> Hi Simone, >>>> >>>> >>>> >>>> I did manage to make some progress with this however I am holiday at the >>>> moment so I haven’t been into the office to test if my solution works on a >>>> large scale. >>>> >>>> >>>> >>>> I modified the BranchingOperation code to “chunk” the data coming >>>> through the pipeline and was able to work around the issue with the >>>> operations being called multiple times by adding a line to the >>>> SqlBulkInsertOperation to check the dictionary before adding the pair >>>> key/value in the “CreateInputSchema” method. See attached files for >>>> changes. >>>> >>>> >>>> >>>> These changes seem to make the Fibonacci Branching performance test pass >>>> when setting the number of rows to over a million. If I monitor the memory >>>> usage while the test is running it seems to peak at a much lower amount. >>>> >>>> >>>> >>>> I will be back in the office on 10th January so I will be able to test >>>> it with our ETL application then. Rather than actually modify the Rhino >>>> ETL >>>> source as I have done in testing, I was planning on just extending these >>>> operation classes and overriding the required methods. Assuming that it >>>> all >>>> works I would also look at making the “chunk” size configurable. >>>> >>>> >>>> >>>> Regards, >>>> >>>> >>>> >>>> Shannon >>>> >>>> >>>> >>>> >>>> >>>> *From:* Simone Busoli [mailto:[email protected]] >>>> *Sent:* Monday, 27 December 2010 5:53 PM >>>> *To:* Shannon Marsh >>>> >>>> >>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else >>>> does not? >>>> >>>> >>>> >>>> Hi Shannon, any news about this issue? >>>> >>>> On Mon, Dec 13, 2010 at 21:55, Simone Busoli <[email protected]> >>>> wrote: >>>> >>>> Sure, keep us informed of the progress, in any case in the next few days >>>> I might find some time to look into it, too. >>>> >>>> >>>> >>>> On Sun, Dec 12, 2010 at 22:45, Shannon Marsh <[email protected]> >>>> wrote: >>>> >>>> Thanks, >>>> >>>> >>>> >>>> I was looking at the code to see if I could batch the rows (chunking). >>>> Mixed results so far. I seem to have problems re-iterating through the >>>> operations. The first batch of rows go through perfectly but the 2nd batch >>>> fails when trying to call the same instance of the operation again, failing >>>> in the PrepareMapping method on SqlBulkInserOperation ("...key has already >>>> been added, etc."). I'll continue to investigate and get back to you when >>>> i >>>> find a solution. I am thinking I may need to clone the operations for each >>>> batch. In the mean time the memory upgrade to our server should get us out >>>> of trouble. >>>> >>>> Regards, >>>> >>>> Shannon >>>> >>>> On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <[email protected]> >>>> wrote: >>>> >>>> Hi Shannon, I see your point. The tricky part here is that we need to >>>> provide the same set of rows to each operation without iterating through >>>> them more than once. I didn't try that, maybe branching them in batches is >>>> doable, have you looked at the code? >>>> >>>> >>>> >>>> On Sun, Dec 12, 2010 at 01:05, Shannon Marsh <[email protected]> >>>> wrote: >>>> >>>> Hi Simone, >>>> >>>> >>>> >>>> Reading the entire post, zvolkov talks about the problem of “file is so >>>> huge as to not fit into memory” and wanting “pulling and pushing one record >>>> at a time but never trying to accumulate all records in memory”. That is >>>> what sparked my interest as it sounded exactly the problem we were >>>> experiencing. >>>> >>>> >>>> >>>> Later zvolkov says “maybe cache only a few rows but not all” and >>>> webpaul says “If you make an IEnumerable that copies the rows one at a >>>> time and feed that to the operation.Execute” >>>> >>>> >>>> >>>> So my understanding of how the fix would work was that it would take a >>>> copy of the row and serve it out to each branch consuming the row, then >>>> repeat this for each row in the pipeline. Something like this. >>>> >>>> >>>> >>>> Branch 1 – Operation 1 – Execute on Row 1. >>>> >>>> Branch 1 – Operation 2 – Execute on Row 1. >>>> >>>> Branch 1 – Operation n – Execute on Row 1. >>>> >>>> >>>> >>>> Branch 2 – Operation 1 – Execute on Row 1. >>>> >>>> Branch 2 – Operation 2 – Execute on Row 1. >>>> >>>> Branch 2 – Operation n – Execute on Row 1. >>>> >>>> >>>> >>>> Branch n – Operation 1 – Execute on Row 1. >>>> >>>> Branch n – Operation 2 – Execute on Row 1. >>>> >>>> Branch n – Operation n – Execute on Row 1. >>>> >>>> >>>> >>>> Branch 1 – Operation 1 – Execute on Row 2. >>>> >>>> Branch 1 – Operation 2 – Execute on Row 2. >>>> >>>> Branch 1 – Operation n – Execute on Row 2. >>>> >>>> >>>> >>>> Branch 2 – Operation 1 – Execute on Row 2. >>>> >>>> Branch 2 – Operation 2 – Execute on Row 2. >>>> >>>> Branch 2 – Operation n – Execute on Row 2. >>>> >>>> >>>> >>>> Branch n – Operation 1 – Execute on Row 2. >>>> >>>> Branch n – Operation 2 – Execute on Row 2. >>>> >>>> Branch n – Operation n – Execute on Row 2. >>>> >>>> >>>> >>>> …etc, etc… for each row. >>>> >>>> >>>> >>>> With that approach I don’t think you would ever need to accumulate rows >>>> in memory. To be honest though I haven’t considered the technicalities of >>>> implementing this and whether it is achievable with the IEnumerable model. >>>> This was just how I imagined it would work after reading this post. >>>> >>>> >>>> >>>> What seems to happen is that streaming does occur, but only for the >>>> first branch. The second and subsequent branches have to wait until the >>>> first branch has consumed all the rows before they start, meaning that all >>>> the rows need to be cached in RAM during the first branch to be available >>>> for the later branches. Like this… >>>> >>>> >>>> >>>> Branch 1 – Operation 1 – Execute on Row 1. >>>> >>>> Branch 1 – Operation 2 – Execute on Row 1. >>>> >>>> Branch 1 – Operation n – Execute on Row 1. >>>> >>>> >>>> >>>> Branch 1 – Operation 1 – Execute on Row 2. >>>> >>>> Branch 1 – Operation 2 – Execute on Row 2. >>>> >>>> Branch 1 – Operation n – Execute on Row 2. >>>> >>>> >>>> >>>> (When branch 1 is complete - all rows cached in RAM) >>>> >>>> >>>> >>>> Branch 2 – Operation 1 – Execute on Row 1. >>>> >>>> Branch 2 – Operation 2 – Execute on Row 1. >>>> >>>> Branch 2 – Operation n – Execute on Row 1. >>>> >>>> >>>> >>>> Branch 2 – Operation 1 – Execute on Row 2. >>>> >>>> Branch 2 – Operation 2 – Execute on Row 2. >>>> >>>> Branch 2 – Operation n – Execute on Row 2. >>>> >>>> >>>> >>>> Branch n – Operation 1 – Execute on Row 1. >>>> >>>> Branch n – Operation 2 – Execute on Row 1. >>>> >>>> Branch n – Operation n – Execute on Row 1. >>>> >>>> >>>> >>>> Branch n – Operation 1 – Execute on Row 2. >>>> >>>> Branch n – Operation 2 – Execute on Row 2. >>>> >>>> Branch n – Operation n – Execute on Row 2. >>>> >>>> >>>> >>>> So again, I may have completely misunderstood how it should work? Or >>>> could we be doing something wrong that causes all rows to cache in memory? >>>> >>>> >>>> >>>> Thanks Again, >>>> >>>> >>>> >>>> Shannon >>>> >>>> >>>> >>>> *From:* Simone B. [mailto:[email protected]] >>>> *Sent:* Friday, 10 December 2010 11:40 AM >>>> *To:* ShannonM >>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What else >>>> does not? >>>> >>>> >>>> >>>> Hi Shannon, can you explain what is the behavior you would expect? >>>> >>>> On Fri, Dec 10, 2010 at 01:19, ShannonM <[email protected]> wrote: >>>> >>>> Hello, >>>> >>>> I realise this is an old post but we seem to be experiencing a similar >>>> issue with memory usage. Our ETL process works with approximately 1 >>>> million records per source table. For a standard straight through >>>> process there are no problems. The rows just stream through and load >>>> to the database and memory peaks at approx 110MB. However, wherever >>>> we use a BranchingOperation in our process the rows accumulate in >>>> memory while executing the first branch (assuming so they are >>>> available for the remaining branches). The subsequent branches do not >>>> execute until the first one has completed. >>>> >>>> The problem is that this usually consumes approx 1.6GB of memory and >>>> depending on other processes running on our server can sometimes cause >>>> a memory exception "System.AccessViolationException: Attempted to read >>>> or write protected memory. This is often an indication that other >>>> memory is corrupt.". We can work-around this issue by re-starting >>>> services (eg. SQL Server) or re-booting the server prior to running >>>> the ETL process to ensure there are no rougue processes hogging >>>> memory. We are also considering moving to at 64Bit OS and adding more >>>> RAM. >>>> >>>> I investigated the Fibonacci Braching tests in the Rhino ETL source >>>> and it seems to behave exactly as I described. If I debug the test >>>> named CanBranchThePipelineEfficiently() I can actually duplicate the >>>> scenario and you can see that all the rows are cached in memory after >>>> the first branch executes. >>>> >>>> Your post seems to indicate that you fixed this issue by using the >>>> cahcing enumerable. I am misunderstanding your post or could we be >>>> doing something wrong? >>>> >>>> While we will probably go ahead with the server ugrage anyway it would >>>> be nice to make our ETL process more effienct and not consume so much >>>> memory if it can be avoided. >>>> >>>> >>>> >>>> On Jul 5 2009, 9:22 pm, Simone Busoli <[email protected]> wrote: >>>> > Fixed. What it's doing now is wrap the input enumerable into a caching >>>> > enumerable and then feed a clone of each row into the operations >>>> making up >>>> > the branch. >>>> > >>>> > >>>> > >>>> >>>> > On Thu, Jul 2, 2009 at 16:53, webpaul <[email protected]> wrote: >>>> > >>>> > > Looking at the code, I think the only reason it isn't yield >>>> returning >>>> > > right now is so it can copy the rows. If you make an IEnumerable >>>> that >>>> > > copies the rows one at a time and feed that to the operation.Execute >>>> I >>>> > > think that is all that is needed. >>>> > >>>> > > On Jul 2, 7:17 am, zvolkov <[email protected]> wrote: >>>> > > > uhh... maybe cache only a few rows but not all? Assuming I branch >>>> in >>>> > > > 2, at worst I will need to cache as many rows as there is the >>>> > > > disparity between the two consumers of my two output streams... >>>> Makes >>>> >>>> > > > sense?- Hide quoted text - >>>> > >>>> > - Show quoted text - >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> > -- You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
