Hi Spark community,
I learned a lot the last time I posted some elementary Spark code here.
So, I thought I would do it again. Someone politely tell me offline if
this is noise or unfair use of the list! I acknowledge that this
borders on asking Scala 101 questions....
I have an RDD[List[String]] corresponding to columns of data and I want
to split one of the columns using some arbitrary function and return an
RDD updated with the new columns. Here is the code I came up with.
def splitColumn(columnsRDD: RDD[List[String]], columnIndex: Int,
numSplits: Int, splitFx: String => List[String]): RDD[List[String]] = {
def insertColumns(columns: List[String]) : List[String] = {
val split = columns.splitAt(columnIndex)
val left = split._1
val splitColumn = split._2.head
val splitColumns = splitFx(splitColumn).padTo(numSplits,
"").take(numSplits)
val right = split._2.tail
left ++ splitColumns ++ right
}
columnsRDD.map(columns => insertColumns(columns))
}
Here is a simple test that demonstrates the behavior:
val spark = new SparkContext("local", "test spark")
val testStrings = List(List("1.2", "a b"), List("3.4", "c d e"),
List("5.6", "f"))
var testRDD: RDD[List[String]] = spark.parallelize(testStrings)
testRDD = splitColumn(testRDD, 0, 2, _.split("\\.").toList)
testRDD = splitColumn(testRDD, 2, 2, _.split(" ").toList) //Line 5
val actualStrings = testRDD.collect.toList
assertEquals(4, actualStrings(0).length)
assertEquals("1, 2, a, b", actualStrings(0).mkString(", "))
assertEquals(4, actualStrings(1).length)
assertEquals("3, 4, c, d", actualStrings(1).mkString(", "))
assertEquals(4, actualStrings(2).length)
assertEquals("5, 6, f, ", actualStrings(2).mkString(", "))
My first concern about this code is that I'm missing out on something
that does exactly this in the API. This seems like such a common use
case that I would not be surprised if there's a readily available way to
do this.
I'm a little uncertain about the typing of splitColumn - i.e. the first
parameter and the return value. It seems like a general solution
wouldn't require every column to be a String value. I'm also annoyed
that line 5 in the test code requires that I use an updated index to
split what was originally the second column. This suggests that perhaps
I should split all the columns that need splitting in one function call
- but it seems like doing that would require an unwieldy function signature.
Any advice or insight is appreciated!
Thanks,
Philip