Re: Sharing Spark RDDs with Ignite
Dmitry, I repeated your test. On my laptop it took about 2300 ms. Having in mind that RDD is lazy by nature I suggested that DataFrame is lazy too. So I add df.rdd().count() call in the code before RDD caching in order to measure execution time and got about 670 ms. After it igniteRDD.saveValues(df.rdd()) call takes about 1500 ms. For more accurate results I measured this operations in a loop and got about 700 ms for RDD caching on warmed up JVM. I created pull request for clarity: https://github.com/erasmas/ignite-playground/pull/1 On Thu, Feb 11, 2016 at 3:20 PM, Dmitriy Morozov <int.2...@gmail.com> wrote: > Hi Valentin, > > Sorry, I realize I didn't get it right. I'm using IgniteRDD to save RDD > values now and IgniteCache to cache StructType. > I'm using a ~1mb Parquet file for testing which has ~75K rows. I noticed > that saving IgniteRDD is expensive, it takes about 4 seconds on my laptop. > I tried both client and server mode for IgniteContext but still couldn't > make it faster. > > Here's the code > <https://github.com/erasmas/ignite-playground/blob/master/src/main/java/ignite/CachedRddExample.java> > that I tried. I'd appreciate if somebody could give a hint on how to make > it faster. > > Thanks! > > On 10 February 2016 at 21:55, vkulichenko <valentin.kuliche...@gmail.com> > wrote: > >> Hi Dmitry, >> >> What are you trying to achieve by putting the RDD into the cache as a >> single >> entry? If you want to save RDD data into the Ignite cache, it's better to >> create IgniteRDD and use its savePairs() or saveValues() methods. See [1] >> for details. >> >> [1] >> >> https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd#section-saving-values-to-ignite >> >> -Val >> >> >> >> -- >> View this message in context: >> http://apache-ignite-users.70518.x6.nabble.com/Sharing-Spark-RDDs-with-Ignite-tp2805p2941.html >> Sent from the Apache Ignite Users mailing list archive at Nabble.com. >> > > > > -- > Kind regards, > Dima > -- Andrey Gura GridGain Systems, Inc. www.gridgain.com
Re: Sharing Spark RDDs with Ignite
Thanks Andrey! It totally makes sense. I should have done a more accurate test. Appreciate your help! On 12 February 2016 at 17:31, Andrey Gura <ag...@gridgain.com> wrote: > Dmitry, > > I repeated your test. On my laptop it took about 2300 ms. > > Having in mind that RDD is lazy by nature I suggested that DataFrame is > lazy too. So I add df.rdd().count() call in the code before RDD caching in > order to measure execution time and got about 670 ms. > After it igniteRDD.saveValues(df.rdd()) call takes about 1500 ms. > > For more accurate results I measured this operations in a loop and got > about 700 ms for RDD caching on warmed up JVM. > > I created pull request for clarity: > https://github.com/erasmas/ignite-playground/pull/1 > > On Thu, Feb 11, 2016 at 3:20 PM, Dmitriy Morozov <int.2...@gmail.com> > wrote: > >> Hi Valentin, >> >> Sorry, I realize I didn't get it right. I'm using IgniteRDD to save RDD >> values now and IgniteCache to cache StructType. >> I'm using a ~1mb Parquet file for testing which has ~75K rows. I noticed >> that saving IgniteRDD is expensive, it takes about 4 seconds on my laptop. >> I tried both client and server mode for IgniteContext but still couldn't >> make it faster. >> >> Here's the code >> <https://github.com/erasmas/ignite-playground/blob/master/src/main/java/ignite/CachedRddExample.java> >> that I tried. I'd appreciate if somebody could give a hint on how to make >> it faster. >> >> Thanks! >> >> On 10 February 2016 at 21:55, vkulichenko <valentin.kuliche...@gmail.com> >> wrote: >> >>> Hi Dmitry, >>> >>> What are you trying to achieve by putting the RDD into the cache as a >>> single >>> entry? If you want to save RDD data into the Ignite cache, it's better to >>> create IgniteRDD and use its savePairs() or saveValues() methods. See [1] >>> for details. >>> >>> [1] >>> >>> https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd#section-saving-values-to-ignite >>> >>> -Val >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-ignite-users.70518.x6.nabble.com/Sharing-Spark-RDDs-with-Ignite-tp2805p2941.html >>> Sent from the Apache Ignite Users mailing list archive at Nabble.com. >>> >> >> >> >> -- >> Kind regards, >> Dima >> > > > > -- > Andrey Gura > GridGain Systems, Inc. > www.gridgain.com > -- Kind regards, Dima
Re: Sharing Spark RDDs with Ignite
Hi Valentin, Sorry, I realize I didn't get it right. I'm using IgniteRDD to save RDD values now and IgniteCache to cache StructType. I'm using a ~1mb Parquet file for testing which has ~75K rows. I noticed that saving IgniteRDD is expensive, it takes about 4 seconds on my laptop. I tried both client and server mode for IgniteContext but still couldn't make it faster. Here's the code <https://github.com/erasmas/ignite-playground/blob/master/src/main/java/ignite/CachedRddExample.java> that I tried. I'd appreciate if somebody could give a hint on how to make it faster. Thanks! On 10 February 2016 at 21:55, vkulichenko <valentin.kuliche...@gmail.com> wrote: > Hi Dmitry, > > What are you trying to achieve by putting the RDD into the cache as a > single > entry? If you want to save RDD data into the Ignite cache, it's better to > create IgniteRDD and use its savePairs() or saveValues() methods. See [1] > for details. > > [1] > > https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd#section-saving-values-to-ignite > > -Val > > > > -- > View this message in context: > http://apache-ignite-users.70518.x6.nabble.com/Sharing-Spark-RDDs-with-Ignite-tp2805p2941.html > Sent from the Apache Ignite Users mailing list archive at Nabble.com. > -- Kind regards, Dima
Re: Sharing Spark RDDs with Ignite
Hi Dmitry, What are you trying to achieve by putting the RDD into the cache as a single entry? If you want to save RDD data into the Ignite cache, it's better to create IgniteRDD and use its savePairs() or saveValues() methods. See [1] for details. [1] https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd#section-saving-values-to-ignite -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Sharing-Spark-RDDs-with-Ignite-tp2805p2941.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Sharing Spark RDDs with Ignite
Hi Dmitry, Ignite provides better data distribution and better performance if there are more partitions than nodes in topology. 1024 is the default number of partitions, but you can change it by providing custom affinity function configuration: CacheConfiguration cfg = new CacheConfiguration("hello-world-cache"). setAffinity(new RendezvousAffinityFunction(false, 32)); // 32 partitions instead of 1024. final IgniteRDD igniteRDD = igniteContext.fromCache(cfg); You can try this and see if it gets better. Actually, I think that methods like isEmpty should be overridden in IgniteRDD to use native IgniteCache API, it will be much faster. I created a ticket for this task [1], feel free to provide your comments there. Are there any other methods that should be optimized? [1] https://issues.apache.org/jira/browse/IGNITE-2538 -Val -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Sharing-Spark-RDDs-with-Ignite-tp2805p2808.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.