Re: [rhino-tools-dev] Rhino ETL how to remove duplicates - efficiently

Miles Waller Tue, 30 Nov 2010 05:30:57 -0800

Hi,

I haven't tried the AbstractAggregationOperation before, but the way
I've done this in the past is to ensure the input source is sorted at the
outset, so the duplicates all follow each other in the rowset.  Then create
a custom Operation that always "remebers" the last yielded row (or just the
key value), and yields subsequent rows only if they differ in the key
columns, or drops them if they are duplicates.  I haven't tested, but the
memory footprint for the operation should just be the one remembered row.


You could avoid sorting the input if you kept a list of key values in the
Operation, but that's more memory for a large set.

Hope that helps,

Miles


On Tue, Nov 30, 2010 at 12:00 PM, jalchr <[email protected]> wrote:

> Now, I'm using the AbstractAggregationOperation which aggregates all
> rows based on some Column(s) ... This works fine but it results it
> huge memory consumption ...
>
> What is the best, efficient way (least memory possible) to use when
> trying to filter the rows based on some column combination (for
> example FirstName, LastName) ...
>
> --
> You received this message because you are subscribed to the Google Groups
> "Rhino Tools Dev" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<rhino-tools-dev%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/rhino-tools-dev?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Rhino Tools Dev" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rhino-tools-dev?hl=en.

Re: [rhino-tools-dev] Rhino ETL how to remove duplicates - efficiently

Reply via email to