Re: [RDDs and Dataframes] Equivalent expressions for RDD API
Just as best practice, dataframe and datasets are preferred way, so try not to resort to rdd unless you absolutely have to... On Sun, 5 Mar 2017 at 7:10 pm, khwunchai jaengsawang <khwuncha...@ku.th> wrote: > Hi Old-Scool, > > > For the first question, you can specify the number of partition in any > DataFrame by using > repartition(numPartitions: Int, partitionExprs: Column*). > *Example*: > val partitioned = data.repartition(numPartitions=10).cache() > > For your second question, you can transform your RDD into PairRDD and use > reduceByKey() > *Example:* > val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_) > > > Best, > > Khwunchai Jaengsawang > *Email*: khwuncha...@ku.th > LinkedIn <https://linkedin.com/in/khwunchai> | Github > <https://github.com/khwunchai> > > > On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianth...@outlook.com> > wrote: > > Hi, > > I want to perform some simple transformations and check the execution time, > under various configurations (e.g. number of cores being used, number of > partitions etc). Since it is not possible to set the partitions of a > dataframe , I guess that I should probably use RDDs. > > I've got a dataset with 3 columns as shown below: > > val data = file.map(line => line.split(" ")) > .filter(lines => lines.length == 3) // ignore first line > .map(row => (row(0), row(1), row(2))) > .toDF("ID", "word-ID", "count") > results in: > > +--++-+ > | ID | word-ID | count | > +--++-+ > | 15 |87 | 151| > | 20 |19 | 398| > | 15 |19 | 21 | > | 180 |90 | 190| > +---+-+ > So how can I turn the above into an RDD in order to use e.g. > sc.parallelize(data, 10) and set the number of partitions to say 10? > > Furthermore, I would also like to ask about the equivalent expression > (using > RDD API) for the following simple transformation: > > data.select("word-ID", > "count").groupBy("word-ID").agg(sum($"count").as("count")).show() > > > > Thanks in advance > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- Best Regards, Ayan Guha
Re: [RDDs and Dataframes] Equivalent expressions for RDD API
Hi Old-Scool, For the first question, you can specify the number of partition in any DataFrame by using repartition(numPartitions: Int, partitionExprs: Column*). Example: val partitioned = data.repartition(numPartitions=10).cache() For your second question, you can transform your RDD into PairRDD and use reduceByKey() Example: val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_) Best, Khwunchai Jaengsawang Email: khwuncha...@ku.th LinkedIn <https://linkedin.com/in/khwunchai> | Github <https://github.com/khwunchai> > On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianth...@outlook.com> > wrote: > > Hi, > > I want to perform some simple transformations and check the execution time, > under various configurations (e.g. number of cores being used, number of > partitions etc). Since it is not possible to set the partitions of a > dataframe , I guess that I should probably use RDDs. > > I've got a dataset with 3 columns as shown below: > > val data = file.map(line => line.split(" ")) > .filter(lines => lines.length == 3) // ignore first line > .map(row => (row(0), row(1), row(2))) > .toDF("ID", "word-ID", "count") > results in: > > +--++-+ > | ID | word-ID | count | > +--++-+ > | 15 |87 | 151| > | 20 |19 | 398| > | 15 |19 | 21 | > | 180 |90 | 190| > +---+-+ > So how can I turn the above into an RDD in order to use e.g. > sc.parallelize(data, 10) and set the number of partitions to say 10? > > Furthermore, I would also like to ask about the equivalent expression (using > RDD API) for the following simple transformation: > > data.select("word-ID", > "count").groupBy("word-ID").agg(sum($"count").as("count")).show() > > > > Thanks in advance > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >
Re: [RDDs and Dataframes] Equivalent expressions for RDD API
Rdd operation: rdd.map(x => (word, count)).reduceByKey(_+_) Get Outlook for Android On Sat, Mar 4, 2017 at 8:59 AM -0500, "Old-School" <giorgos_myrianth...@outlook.com> wrote: Hi, I want to perform some simple transformations and check the execution time, under various configurations (e.g. number of cores being used, number of partitions etc). Since it is not possible to set the partitions of a dataframe , I guess that I should probably use RDDs. I've got a dataset with 3 columns as shown below: val data = file.map(line => line.split(" ")) .filter(lines => lines.length == 3) // ignore first line .map(row => (row(0), row(1), row(2))) .toDF("ID", "word-ID", "count") results in: +--++-+ | ID | word-ID | count | +--++-+ | 15 |87 | 151| | 20 |19 | 398| | 15 |19 | 21 | | 180 |90 | 190| +---+-+ So how can I turn the above into an RDD in order to use e.g. sc.parallelize(data, 10) and set the number of partitions to say 10? Furthermore, I would also like to ask about the equivalent expression (using RDD API) for the following simple transformation: data.select("word-ID", "count").groupBy("word-ID").agg(sum($"count").as("count")).show() Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[RDDs and Dataframes] Equivalent expressions for RDD API
Hi, I want to perform some simple transformations and check the execution time, under various configurations (e.g. number of cores being used, number of partitions etc). Since it is not possible to set the partitions of a dataframe , I guess that I should probably use RDDs. I've got a dataset with 3 columns as shown below: val data = file.map(line => line.split(" ")) .filter(lines => lines.length == 3) // ignore first line .map(row => (row(0), row(1), row(2))) .toDF("ID", "word-ID", "count") results in: +--++-+ | ID | word-ID | count | +--++-+ | 15 |87 | 151| | 20 |19 | 398| | 15 |19 | 21 | | 180 |90 | 190| +---+-+ So how can I turn the above into an RDD in order to use e.g. sc.parallelize(data, 10) and set the number of partitions to say 10? Furthermore, I would also like to ask about the equivalent expression (using RDD API) for the following simple transformation: data.select("word-ID", "count").groupBy("word-ID").agg(sum($"count").as("count")).show() Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org