This has been reviewed and pulled into the main repo. You'll find the changes on build 24 here: http://builds.hibernatingrhinos.com/builds/Rhino-ETL
<http://builds.hibernatingrhinos.com/builds/Rhino-ETL>Nathan On Fri, Jan 21, 2011 at 5:51 AM, Nathan Palmer <[email protected]> wrote: > I did see the pull request. I'm out of town at the moment and probably > won't get a chance to review it until the first of next week. > > Nathan Palmer > > Sent from my Phone > > On Jan 21, 2011, at 2:01 AM, Simone Busoli <[email protected]> > wrote: > > It was taken off the list but there was a requirement for a branching > operation which streams data correctly. Mail thread is below. There is a > pending pull request on the main repo. > > On Fri, Jan 21, 2011 at 07:56, Simone Busoli < <[email protected]> > [email protected]> wrote: > >> Cool, great to hear that! >> >> On Fri, Jan 21, 2011 at 06:42, Shannon Marsh < <[email protected]> >> [email protected]> wrote: >> >>> Simone, >>> >>> We finally got to implement the MultiThreadedBranchingOperation in our >>> ETL process and it works very well. >>> >>> BEFORE: >>> 2011-01-21 04:04:19,410 [1] INFO WA.LS.Migration.ETL.Runner [(null)] - >>> Process Name PartyRegisterETLProcess. Current Memory Usage: 1495 MB. >>> 2011-01-21 04:04:19,410 [1] INFO WA.LS.Migration.ETL.Runner [(null)] - >>> Process Name PartyRegisterETLProcess. Peak Memory Usage: 1691 MB. >>> >>> AFTER: >>> 2011-01-20 18:54:30,500 [10] INFO WA.LS.Migration.ETL.Runner [(null)] - >>> Process Name PartyRegisterETLProcess. Current Memory Usage: 48 MB. >>> 2011-01-20 18:54:30,500 [10] INFO WA.LS.Migration.ETL.Runner [(null)] - >>> Process Name PartyRegisterETLProcess. Peak Memory Usage: 59 MB. >>> Thanks again, >>> >>> Shannon >>> On Mon, Jan 3, 2011 at 7:25 PM, Simone Busoli <<[email protected]> >>> [email protected]> wrote: >>> >>>> Hi Shannon, in the meanwhile I improved it a bit by removing the need >>>> for using the threaded pipeline and also noticed that recently a >>>> non-caching >>>> single threaded pipeline has been added so you can use any of the two. I >>>> also forwarded a pull request to the main repository. >>>> >>>> Regards, Simone >>>> >>>> On Sun, Jan 2, 2011 at 23:35, Shannon Marsh < <[email protected]> >>>> [email protected]> wrote: >>>> >>>>> Hi Simone, >>>>> >>>>> >>>>> >>>>> Thanks, that seems like a more robust solution. I agree that there >>>>> could be a problem if there were too many child operations but in our case >>>>> our ETL process is fairly simplistic. The problem was just the number of >>>>> records we were dealing with. I guess it comes down to using the right >>>>> tool for the job. This solution gives us another tool to choose from. >>>>> Can’t wait to try out this solution when I’m back in the office next >>>>> week. >>>>> >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Shannon >>>>> >>>>> >>>>> >>>>> *From:* Simone Busoli [mailto: <[email protected]> >>>>> [email protected]] >>>>> *Sent:* Sunday, 2 January 2011 10:51 AM >>>>> >>>>> *To:* Shannon Marsh >>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What >>>>> else does not? >>>>> >>>>> >>>>> >>>>> Hi Shannon, I pushed a change which includes a new operation which >>>>> optimizes memory consumption in branching scenarios, while instead >>>>> performs >>>>> worse than the other one when it comes to duration due to thread >>>>> synchronization (which BTW shouldn't represent a bit problem as long as >>>>> you >>>>> don't branch to many children operations). It's called >>>>> MultiThreadedBranchingOperation. Take into account that it needs the >>>>> multi-threaded pipeline runner because the single threaded one relies on >>>>> the >>>>> caching enumerable which caches all the input rows and exhibits the same >>>>> issue you have described. >>>>> My branch on github is here <https://github.com/simoneb/rhino-etl>. >>>>> >>>>> On Wed, Dec 29, 2010 at 12:43, Simone Busoli <<[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Hi Shannon, thanks for the update, unfortunately your solution won't >>>>> work correctly in the general case, as you stated the chunking you have >>>>> implemented implies executing the operations more than once, which is not >>>>> the desired behavior. I will look into it today to find out if there is a >>>>> more general solution to the problem. >>>>> >>>>> Simone >>>>> >>>>> >>>>> >>>>> On Tue, Dec 28, 2010 at 00:49, Shannon Marsh < <[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Hi Simone, >>>>> >>>>> >>>>> >>>>> I did manage to make some progress with this however I am holiday at >>>>> the moment so I haven’t been into the office to test if my solution works >>>>> on >>>>> a large scale. >>>>> >>>>> >>>>> >>>>> I modified the BranchingOperation code to “chunk” the data coming >>>>> through the pipeline and was able to work around the issue with the >>>>> operations being called multiple times by adding a line to the >>>>> SqlBulkInsertOperation to check the dictionary before adding the pair >>>>> key/value in the “CreateInputSchema” method. See attached files for >>>>> changes. >>>>> >>>>> >>>>> >>>>> These changes seem to make the Fibonacci Branching performance test >>>>> pass when setting the number of rows to over a million. If I monitor the >>>>> memory usage while the test is running it seems to peak at a much lower >>>>> amount. >>>>> >>>>> >>>>> >>>>> I will be back in the office on 10th January so I will be able to test >>>>> it with our ETL application then. Rather than actually modify the Rhino >>>>> ETL >>>>> source as I have done in testing, I was planning on just extending these >>>>> operation classes and overriding the required methods. Assuming that it >>>>> all >>>>> works I would also look at making the “chunk” size configurable. >>>>> >>>>> >>>>> >>>>> Regards, >>>>> >>>>> >>>>> >>>>> Shannon >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* Simone Busoli [mailto: <[email protected]> >>>>> [email protected]] >>>>> *Sent:* Monday, 27 December 2010 5:53 PM >>>>> *To:* Shannon Marsh >>>>> >>>>> >>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What >>>>> else does not? >>>>> >>>>> >>>>> >>>>> Hi Shannon, any news about this issue? >>>>> >>>>> On Mon, Dec 13, 2010 at 21:55, Simone Busoli <<[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Sure, keep us informed of the progress, in any case in the next few >>>>> days I might find some time to look into it, too. >>>>> >>>>> >>>>> >>>>> On Sun, Dec 12, 2010 at 22:45, Shannon Marsh < <[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> >>>>> I was looking at the code to see if I could batch the rows (chunking). >>>>> Mixed results so far. I seem to have problems re-iterating through the >>>>> operations. The first batch of rows go through perfectly but the 2nd >>>>> batch >>>>> fails when trying to call the same instance of the operation again, >>>>> failing >>>>> in the PrepareMapping method on SqlBulkInserOperation ("...key has already >>>>> been added, etc."). I'll continue to investigate and get back to you >>>>> when i >>>>> find a solution. I am thinking I may need to clone the operations for >>>>> each >>>>> batch. In the mean time the memory upgrade to our server should get us >>>>> out >>>>> of trouble. >>>>> >>>>> Regards, >>>>> >>>>> Shannon >>>>> >>>>> On Mon, Dec 13, 2010 at 3:00 AM, Simone Busoli <<[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Hi Shannon, I see your point. The tricky part here is that we need to >>>>> provide the same set of rows to each operation without iterating through >>>>> them more than once. I didn't try that, maybe branching them in batches is >>>>> doable, have you looked at the code? >>>>> >>>>> >>>>> >>>>> On Sun, Dec 12, 2010 at 01:05, Shannon Marsh < <[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Hi Simone, >>>>> >>>>> >>>>> >>>>> Reading the entire post, zvolkov talks about the problem of “file is so >>>>> huge as to not fit into memory” and wanting “pulling and pushing one >>>>> record >>>>> at a time but never trying to accumulate all records in memory”. That >>>>> is >>>>> what sparked my interest as it sounded exactly the problem we were >>>>> experiencing. >>>>> >>>>> >>>>> >>>>> Later zvolkov says “maybe cache only a few rows but not all” and >>>>> webpaul says “If you make an IEnumerable that copies the rows one at a >>>>> time and feed that to the operation.Execute” >>>>> >>>>> >>>>> >>>>> So my understanding of how the fix would work was that it would take a >>>>> copy of the row and serve it out to each branch consuming the row, then >>>>> repeat this for each row in the pipeline. Something like this. >>>>> >>>>> >>>>> >>>>> Branch 1 – Operation 1 – Execute on Row 1. >>>>> >>>>> Branch 1 – Operation 2 – Execute on Row 1. >>>>> >>>>> Branch 1 – Operation n – Execute on Row 1. >>>>> >>>>> >>>>> >>>>> Branch 2 – Operation 1 – Execute on Row 1. >>>>> >>>>> Branch 2 – Operation 2 – Execute on Row 1. >>>>> >>>>> Branch 2 – Operation n – Execute on Row 1. >>>>> >>>>> >>>>> >>>>> Branch n – Operation 1 – Execute on Row 1. >>>>> >>>>> Branch n – Operation 2 – Execute on Row 1. >>>>> >>>>> Branch n – Operation n – Execute on Row 1. >>>>> >>>>> >>>>> >>>>> Branch 1 – Operation 1 – Execute on Row 2. >>>>> >>>>> Branch 1 – Operation 2 – Execute on Row 2. >>>>> >>>>> Branch 1 – Operation n – Execute on Row 2. >>>>> >>>>> >>>>> >>>>> Branch 2 – Operation 1 – Execute on Row 2. >>>>> >>>>> Branch 2 – Operation 2 – Execute on Row 2. >>>>> >>>>> Branch 2 – Operation n – Execute on Row 2. >>>>> >>>>> >>>>> >>>>> Branch n – Operation 1 – Execute on Row 2. >>>>> >>>>> Branch n – Operation 2 – Execute on Row 2. >>>>> >>>>> Branch n – Operation n – Execute on Row 2. >>>>> >>>>> >>>>> >>>>> …etc, etc… for each row. >>>>> >>>>> >>>>> >>>>> With that approach I don’t think you would ever need to accumulate rows >>>>> in memory. To be honest though I haven’t considered the technicalities >>>>> of >>>>> implementing this and whether it is achievable with the IEnumerable model. >>>>> This was just how I imagined it would work after reading this post. >>>>> >>>>> >>>>> >>>>> What seems to happen is that streaming does occur, but only for the >>>>> first branch. The second and subsequent branches have to wait until the >>>>> first branch has consumed all the rows before they start, meaning that all >>>>> the rows need to be cached in RAM during the first branch to be available >>>>> for the later branches. Like this… >>>>> >>>>> >>>>> >>>>> Branch 1 – Operation 1 – Execute on Row 1. >>>>> >>>>> Branch 1 – Operation 2 – Execute on Row 1. >>>>> >>>>> Branch 1 – Operation n – Execute on Row 1. >>>>> >>>>> >>>>> >>>>> Branch 1 – Operation 1 – Execute on Row 2. >>>>> >>>>> Branch 1 – Operation 2 – Execute on Row 2. >>>>> >>>>> Branch 1 – Operation n – Execute on Row 2. >>>>> >>>>> >>>>> >>>>> (When branch 1 is complete - all rows cached in RAM) >>>>> >>>>> >>>>> >>>>> Branch 2 – Operation 1 – Execute on Row 1. >>>>> >>>>> Branch 2 – Operation 2 – Execute on Row 1. >>>>> >>>>> Branch 2 – Operation n – Execute on Row 1. >>>>> >>>>> >>>>> >>>>> Branch 2 – Operation 1 – Execute on Row 2. >>>>> >>>>> Branch 2 – Operation 2 – Execute on Row 2. >>>>> >>>>> Branch 2 – Operation n – Execute on Row 2. >>>>> >>>>> >>>>> >>>>> Branch n – Operation 1 – Execute on Row 1. >>>>> >>>>> Branch n – Operation 2 – Execute on Row 1. >>>>> >>>>> Branch n – Operation n – Execute on Row 1. >>>>> >>>>> >>>>> >>>>> Branch n – Operation 1 – Execute on Row 2. >>>>> >>>>> Branch n – Operation 2 – Execute on Row 2. >>>>> >>>>> Branch n – Operation n – Execute on Row 2. >>>>> >>>>> >>>>> >>>>> So again, I may have completely misunderstood how it should work? Or >>>>> could we be doing something wrong that causes all rows to cache in memory? >>>>> >>>>> >>>>> >>>>> Thanks Again, >>>>> >>>>> >>>>> >>>>> Shannon >>>>> >>>>> >>>>> >>>>> *From:* Simone B. [mailto: <[email protected]> >>>>> [email protected]] >>>>> *Sent:* Friday, 10 December 2010 11:40 AM >>>>> *To:* ShannonM >>>>> *Subject:* Re: Rhino ETL: BranchingOperation does not stream. What >>>>> else does not? >>>>> >>>>> >>>>> >>>>> Hi Shannon, can you explain what is the behavior you would expect? >>>>> >>>>> On Fri, Dec 10, 2010 at 01:19, ShannonM < <[email protected]> >>>>> [email protected]> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I realise this is an old post but we seem to be experiencing a similar >>>>> issue with memory usage. Our ETL process works with approximately 1 >>>>> million records per source table. For a standard straight through >>>>> process there are no problems. The rows just stream through and load >>>>> to the database and memory peaks at approx 110MB. However, wherever >>>>> we use a BranchingOperation in our process the rows accumulate in >>>>> memory while executing the first branch (assuming so they are >>>>> available for the remaining branches). The subsequent branches do not >>>>> execute until the first one has completed. >>>>> >>>>> The problem is that this usually consumes approx 1.6GB of memory and >>>>> depending on other processes running on our server can sometimes cause >>>>> a memory exception "System.AccessViolationException: Attempted to read >>>>> or write protected memory. This is often an indication that other >>>>> memory is corrupt.". We can work-around this issue by re-starting >>>>> services (eg. SQL Server) or re-booting the server prior to running >>>>> the ETL process to ensure there are no rougue processes hogging >>>>> memory. We are also considering moving to at 64Bit OS and adding more >>>>> RAM. >>>>> >>>>> I investigated the Fibonacci Braching tests in the Rhino ETL source >>>>> and it seems to behave exactly as I described. If I debug the test >>>>> named CanBranchThePipelineEfficiently() I can actually duplicate the >>>>> scenario and you can see that all the rows are cached in memory after >>>>> the first branch executes. >>>>> >>>>> Your post seems to indicate that you fixed this issue by using the >>>>> cahcing enumerable. I am misunderstanding your post or could we be >>>>> doing something wrong? >>>>> >>>>> While we will probably go ahead with the server ugrage anyway it would >>>>> be nice to make our ETL process more effienct and not consume so much >>>>> memory if it can be avoided. >>>>> >>>>> >>>>> >>>>> On Jul 5 2009, 9:22 pm, Simone Busoli <[email protected]> wrote: >>>>> > Fixed. What it's doing now is wrap the input enumerable into a >>>>> caching >>>>> > enumerable and then feed a clone of each row into the operations >>>>> making up >>>>> > the branch. >>>>> > >>>>> > >>>>> > >>>>> >>>>> > On Thu, Jul 2, 2009 at 16:53, webpaul <[email protected]> wrote: >>>>> > >>>>> > > Looking at the code, I think the only reason it isn't yield >>>>> returning >>>>> > > right now is so it can copy the rows. If you make an IEnumerable >>>>> that >>>>> > > copies the rows one at a time and feed that to the >>>>> operation.Execute I >>>>> > > think that is all that is needed. >>>>> > >>>>> > > On Jul 2, 7:17 am, zvolkov <[email protected]> wrote: >>>>> > > > uhh... maybe cache only a few rows but not all? Assuming I branch >>>>> in >>>>> > > > 2, at worst I will need to cache as many rows as there is the >>>>> > > > disparity between the two consumers of my two output streams... >>>>> Makes >>>>> >>>>> > > > sense?- Hide quoted text - >>>>> > >>>>> > - Show quoted text - >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> > -- > You received this message because you are subscribed to the Google Groups > "Rhino Tools Dev" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/rhino-tools-dev?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
