Re: Column operation on Spark RDDs.

2015-06-08 Thread lonikar
Two simple suggestions:
1. No need to call zipWithIndex twice. Use the earlier RDD dt.
2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark
job

Below your code with the above changes:

var dataRDD = sc.textFile(/test.csv).map(_.split(,))
val dt = dataRDD.*zipWithUniqueId*.map(_.swap)
val newCol1 = *dt*.map {case (i, x) = (i, x(1)+x(18)) }
val newCol2 = newCol1.join(dt).map(x= function(.))

Hope this helps.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165p23203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Column operation on Spark RDDs.

2015-06-08 Thread kiran lonikar
Two simple suggestions:
1. No need to call zipWithIndex twice. Use the earlier RDD dt.
2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark
job

Below your code with the above changes:

var dataRDD = sc.textFile(/test.csv).map(_.split(,))
val dt = dataRDD.*zipWithUniqueId*.map(_.swap)
val newCol1 = *dt*.map {case (i, x) = (i, x(1)+x(18)) }
val newCol2 = newCol1.join(dt).map(x= function(.))

Hope this helps.
Kiran


On Fri, Jun 5, 2015 at 8:15 AM, Carter gyz...@hotmail.com wrote:

 Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my
 operation
 is on columns, e.g., I need to create many intermediate variables from
 different columns, what is the most efficient way to do this?

 For example, if my dataRDD[Array[String]] is like below:

 123, 523, 534, ..., 893
 536, 98, 1623, ..., 98472
 537, 89, 83640, ..., 9265
 7297, 98364, 9, ..., 735
 ..
 29, 94, 956, ..., 758

 I will need to create a new column or a variable as newCol1 =
 2ndCol+19thCol, and another new column based on newCol1 and the existing
 columns: newCol2 = function(newCol1, 34thCol), what is the best way of
 doing
 this?

 I have been thinking using index for the intermediate variables and the
 dataRDD, and then join them together on the index to do my calculation:
 var dataRDD = sc.textFile(/test.csv).map(_.split(,))
 val dt = dataRDD.zipWithIndex.map(_.swap)
 val newCol1 = dataRDD.map(x = x(1)+x(18)).zipWithIndex.map(_.swap)
 val newCol2 = newCol1.join(dt).map(x= function(.))

 Is there a better way of doing this?

 Thank you very much!












 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Column operation on Spark RDDs.

2015-06-04 Thread Carter
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation
is on columns, e.g., I need to create many intermediate variables from
different columns, what is the most efficient way to do this?

For example, if my dataRDD[Array[String]] is like below: 

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
.. 
29, 94, 956, ..., 758 

I will need to create a new column or a variable as newCol1 =
2ndCol+19thCol, and another new column based on newCol1 and the existing
columns: newCol2 = function(newCol1, 34thCol), what is the best way of doing
this?

I have been thinking using index for the intermediate variables and the
dataRDD, and then join them together on the index to do my calculation:
var dataRDD = sc.textFile(/test.csv).map(_.split(,))
val dt = dataRDD.zipWithIndex.map(_.swap)
val newCol1 = dataRDD.map(x = x(1)+x(18)).zipWithIndex.map(_.swap)
val newCol2 = newCol1.join(dt).map(x= function(.))

Is there a better way of doing this?

Thank you very much!












--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org