Hi guys, I'm interested in the IndexedRDD too. How many rows in the big table that matches the small table in every run? If the number of rows stay constant, then I think Jem wants the runtime to stay about constant (i.e. ~ 0.6 second for all cases). However, I agree with Andrew. The performance wasn't that bad at all. If it is not indexed, I expect it to take much longer time.
Can IndexedRDD be sorted by keys as well? Best Regards, Jerry On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash <and...@andrewash.com> wrote: > Hi Jem, > > Linear time in scaling on the big table doesn't seem that surprising to > me. What were you expecting? > > I assume you're doing normalRDD.join(indexedRDD). If you were to replace > the indexedRDD with a normal RDD, what times do you get? > > On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker <jem.tuc...@gmail.com> wrote: > >> Hi, >> >> I have been playing around with the indexedRDD ( >> https://issues.apache.org/jira/browse/SPARK-2365, >> https://github.com/amplab/spark-indexedrdd) and have been very impressed >> with its performance. Some performance testing has revealed worse than >> expected scaling of the join performance*, and I was just wondering if >> anyone else has any experience using it and what they have found? >> >> Thanks, >> >> Jem >> >> *Table below shows some of my results when joining a small RDD to a large >> IndexedRDD. Each table consisted of a Long key and 15 character String >> value. Shows an almost linear time increase with the number of rows in the >> bigger table. >> >> Small Table Rows >> >> Big Table Rows >> >> Time >> >> (s) >> >> 50000 >> >> 10000000 >> >> 0.6 >> >> 50000 >> >> 50000000 >> >> 0.8 >> >> 50000 >> >> 100000000 >> >> 1.5 >> >> 50000 >> >> 150000000 >> >> 2.1 >> >> 50000 >> >> 200000000 >> >> 2.8 >> >> 50000 >> >> 500000000 >> >> 7.2 >> >> 50000 >> >> 1000000000 >> >> 12.2 >> > >