Hi Spark community,

I learned a lot the last time I posted some elementary Spark code here. So, I thought I would do it again. Someone politely tell me offline if this is noise or unfair use of the list! I acknowledge that this borders on asking Scala 101 questions....

I have an RDD[List[String]] corresponding to columns of data and I want to split one of the columns using some arbitrary function and return an RDD updated with the new columns. Here is the code I came up with.

def splitColumn(columnsRDD: RDD[List[String]], columnIndex: Int, numSplits: Int, splitFx: String => List[String]): RDD[List[String]] = {

    def insertColumns(columns: List[String]) : List[String] = {
      val split = columns.splitAt(columnIndex)
      val left = split._1
      val splitColumn = split._2.head
val splitColumns = splitFx(splitColumn).padTo(numSplits, "").take(numSplits)
      val right = split._2.tail
      left ++ splitColumns ++ right
    }

    columnsRDD.map(columns => insertColumns(columns))
  }

Here is a simple test that demonstrates the behavior:

      val spark = new SparkContext("local", "test spark")
val testStrings = List(List("1.2", "a b"), List("3.4", "c d e"), List("5.6", "f"))
      var testRDD: RDD[List[String]] = spark.parallelize(testStrings)
      testRDD = splitColumn(testRDD, 0, 2, _.split("\\.").toList)
      testRDD = splitColumn(testRDD, 2, 2, _.split(" ").toList) //Line 5
      val actualStrings = testRDD.collect.toList
      assertEquals(4, actualStrings(0).length)
      assertEquals("1, 2, a, b", actualStrings(0).mkString(", "))
      assertEquals(4, actualStrings(1).length)
      assertEquals("3, 4, c, d", actualStrings(1).mkString(", "))
      assertEquals(4, actualStrings(2).length)
      assertEquals("5, 6, f, ", actualStrings(2).mkString(", "))


My first concern about this code is that I'm missing out on something that does exactly this in the API. This seems like such a common use case that I would not be surprised if there's a readily available way to do this.

I'm a little uncertain about the typing of splitColumn - i.e. the first parameter and the return value. It seems like a general solution wouldn't require every column to be a String value. I'm also annoyed that line 5 in the test code requires that I use an updated index to split what was originally the second column. This suggests that perhaps I should split all the columns that need splitting in one function call - but it seems like doing that would require an unwieldy function signature.

Any advice or insight is appreciated!

Thanks,
Philip

Reply via email to