Re: IndexedRDD

Jerry Lam Tue, 13 Jan 2015 09:07:43 -0800

Hi guys,

I'm interested in the IndexedRDD too.
How many rows in the big table that matches the small table in every run?
If the number of rows stay constant, then I think Jem wants the runtime to
stay about constant (i.e. ~ 0.6 second for all cases). However, I agree
with Andrew. The performance wasn't that bad at all. If it is not indexed,
I expect it to take much longer time.


Can IndexedRDD be sorted by keys as well?

Best Regards,

Jerry

On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Jem,
>
> Linear time in scaling on the big table doesn't seem that surprising to
> me.  What were you expecting?
>
> I assume you're doing normalRDD.join(indexedRDD).  If you were to replace
> the indexedRDD with a normal RDD, what times do you get?
>
> On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker <jem.tuc...@gmail.com> wrote:
>
>> Hi,
>>
>> I have been playing around with the indexedRDD (
>> https://issues.apache.org/jira/browse/SPARK-2365,
>> https://github.com/amplab/spark-indexedrdd) and have been very impressed
>> with its performance. Some performance testing has revealed worse than
>> expected scaling of the join performance*, and I was just wondering if
>> anyone else has any experience using it and what they have found?
>>
>> Thanks,
>>
>> Jem
>>
>> *Table below shows some of my results when joining a small RDD to a large
>> IndexedRDD.  Each table consisted of a Long key and 15 character String
>> value. Shows an almost linear time increase with the number of rows in the
>> bigger table.
>>
>> Small Table Rows
>>
>>  Big Table Rows
>>
>> Time
>>
>> (s)
>>
>> 50000
>>
>> 10000000
>>
>> 0.6
>>
>> 50000
>>
>> 50000000
>>
>> 0.8
>>
>> 50000
>>
>> 100000000
>>
>> 1.5
>>
>> 50000
>>
>> 150000000
>>
>> 2.1
>>
>> 50000
>>
>> 200000000
>>
>> 2.8
>>
>> 50000
>>
>> 500000000
>>
>> 7.2
>>
>> 50000
>>
>> 1000000000
>>
>> 12.2
>>
>
>

Re: IndexedRDD

Reply via email to