Hi guys,

I'm interested in the IndexedRDD too.
How many rows in the big table that matches the small table in every run?
If the number of rows stay constant, then I think Jem wants the runtime to
stay about constant (i.e. ~ 0.6 second for all cases). However, I agree
with Andrew. The performance wasn't that bad at all. If it is not indexed,
I expect it to take much longer time.

Can IndexedRDD be sorted by keys as well?

Best Regards,

Jerry

On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Jem,
>
> Linear time in scaling on the big table doesn't seem that surprising to
> me.  What were you expecting?
>
> I assume you're doing normalRDD.join(indexedRDD).  If you were to replace
> the indexedRDD with a normal RDD, what times do you get?
>
> On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker <jem.tuc...@gmail.com> wrote:
>
>> Hi,
>>
>> I have been playing around with the indexedRDD (
>> https://issues.apache.org/jira/browse/SPARK-2365,
>> https://github.com/amplab/spark-indexedrdd) and have been very impressed
>> with its performance. Some performance testing has revealed worse than
>> expected scaling of the join performance*, and I was just wondering if
>> anyone else has any experience using it and what they have found?
>>
>> Thanks,
>>
>> Jem
>>
>> *Table below shows some of my results when joining a small RDD to a large
>> IndexedRDD.  Each table consisted of a Long key and 15 character String
>> value. Shows an almost linear time increase with the number of rows in the
>> bigger table.
>>
>> Small Table Rows
>>
>>  Big Table Rows
>>
>> Time
>>
>> (s)
>>
>> 50000
>>
>> 10000000
>>
>> 0.6
>>
>> 50000
>>
>> 50000000
>>
>> 0.8
>>
>> 50000
>>
>> 100000000
>>
>> 1.5
>>
>> 50000
>>
>> 150000000
>>
>> 2.1
>>
>> 50000
>>
>> 200000000
>>
>> 2.8
>>
>> 50000
>>
>> 500000000
>>
>> 7.2
>>
>> 50000
>>
>> 1000000000
>>
>> 12.2
>>
>
>

Reply via email to