This is in response to your question about something in the API that
already does this.  You might want to keep your eye on MLI (
http://www.mlbase.org), which is columnar table written for machine
learning but applicable to a lot of problems.  It's not perfect right now.


On Fri, Nov 15, 2013 at 7:56 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Regarding only your last point, you could always split backwards to avoid
> having to worry about updated indices (i.e., split the highest index column
> first). But if you're additionally worried about efficiency, a combined
> approach could make more sense to avoid making two full passes on the data.
>
> Otherwise, I don't see anything particularly amiss here, but I'm no expert.
>
>
> On Wed, Nov 13, 2013 at 3:00 PM, Philip Ogren <philip.og...@oracle.com>wrote:
>
>> Hi Spark community,
>>
>> I learned a lot the last time I posted some elementary Spark code here.
>>  So, I thought I would do it again.  Someone politely tell me offline if
>> this is noise or unfair use of the list!  I acknowledge that this borders
>> on asking Scala 101 questions....
>>
>> I have an RDD[List[String]] corresponding to columns of data and I want
>> to split one of the columns using some arbitrary function and return an RDD
>> updated with the new columns.  Here is the code I came up with.
>>
>> def splitColumn(columnsRDD: RDD[List[String]], columnIndex: Int,
>> numSplits: Int, splitFx: String => List[String]): RDD[List[String]] = {
>>
>>     def insertColumns(columns: List[String]) : List[String] = {
>>       val split = columns.splitAt(columnIndex)
>>       val left = split._1
>>       val splitColumn = split._2.head
>>       val splitColumns = splitFx(splitColumn).padTo(numSplits,
>> "").take(numSplits)
>>       val right = split._2.tail
>>       left ++ splitColumns ++ right
>>     }
>>
>>     columnsRDD.map(columns => insertColumns(columns))
>>   }
>>
>> Here is a simple test that demonstrates the behavior:
>>
>>       val spark = new SparkContext("local", "test spark")
>>       val testStrings = List(List("1.2", "a b"), List("3.4", "c d e"),
>> List("5.6", "f"))
>>       var testRDD: RDD[List[String]] = spark.parallelize(testStrings)
>>       testRDD = splitColumn(testRDD, 0, 2, _.split("\\.").toList)
>>       testRDD = splitColumn(testRDD, 2, 2, _.split(" ").toList) //Line 5
>>       val actualStrings = testRDD.collect.toList
>>       assertEquals(4, actualStrings(0).length)
>>       assertEquals("1, 2, a, b", actualStrings(0).mkString(", "))
>>       assertEquals(4, actualStrings(1).length)
>>       assertEquals("3, 4, c, d", actualStrings(1).mkString(", "))
>>       assertEquals(4, actualStrings(2).length)
>>       assertEquals("5, 6, f, ", actualStrings(2).mkString(", "))
>>
>>
>> My first concern about this code is that I'm missing out on something
>> that does exactly this in the API.  This seems like such a common use case
>> that I would not be surprised if there's a readily available way to do this.
>>
>> I'm a little uncertain about the typing of splitColumn - i.e. the first
>> parameter and the return value.  It seems like a general solution wouldn't
>> require every column to be a String value.  I'm also annoyed that line 5 in
>> the test code requires that I use an updated index to split what was
>> originally the second column.  This suggests that perhaps I should split
>> all the columns that need splitting in one function call - but it seems
>> like doing that would require an unwieldy function signature.
>>
>> Any advice or insight is appreciated!
>>
>> Thanks,
>> Philip
>>
>
>

Reply via email to