Re: Nested RDD operation

Daniel O' Shaughnessy Tue, 19 Sep 2017 04:21:37 -0700

Thanks for your response Jean.

I managed to figure this out in the end but it's an extremely slow solution
and not tenable for my use-case:


val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
//val oneRow =
Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
rddX.take(5).foreach(println)
val severalRows = rddX.collect().map(row =>
if (row.length == 1) {
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
} else {
row.map(tool => {
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(
0))
})
})

Wondering if there is any better/faster way to do this ?

Thanks.



On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin <[email protected]> wrote:

> Hey Daniel, not sure this will help, but... I had a similar need where i
> wanted the content of a dataframe to become a "cell" or a row in the parent
> dataframe. I grouped by the child dataframe, then collect it as a list in
> the parent dataframe after a join operation. As I said, not sure it matches
> your use case, but HIH...
> jg
>
> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy <
> [email protected]> wrote:
>
> Hi guys,
>
> I'm having trouble implementing this scenario:
>
> I have a column with a typical entry being : ['apple', 'orange', 'apple',
> 'pear', 'pear']
>
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>
> I'm attempting to do this but because of the nested operation on another
> RDD I get the NPE.
>
> Here's my code so far, thanks:
>
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF(
> "email", "event_name")
>
> // attempting
> import sqlContext.implicits._
> val event_list = dfWithSchema.select("event_name").distinct
> val event_listDF = event_list.toDF()
> val eventIndexer = new StringIndexer()
> .setInputCol("event_name")
> .setOutputCol("eventIndex")
> .fit(event_listDF)
>
> val eventIndexed = eventIndexer.transform(event_listDF)
>
> val converter = new IndexToString()
> .setInputCol("eventIndex")
> .setOutputCol("originalCategory")
>
> val convertedEvents = converter.transform(eventIndexed)
> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
> ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
> //val oneRow =
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>
> val severalRows = rddX.map(row => {
> // Split array into n tools
> println("ROW: " + row(0).toString)
> println(row(0).getClass)
> println("PRINT: " +
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
> ))).toDF("event_name")).select("eventIndex").first().getDouble(0))
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq
> (row).toString)
> })
> // attempting
>
>
>

Re: Nested RDD operation

Reply via email to