Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
Have you tried to cache? maybe after the collect() and before the map?

> On Sep 19, 2017, at 7:20 AM, Daniel O' Shaughnessy 
>  wrote:
> 
> Thanks for your response Jean. 
> 
> I managed to figure this out in the end but it's an extremely slow solution 
> and not tenable for my use-case:
> 
>   val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>   //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>   rddX.take(5).foreach(println)
>   val severalRows = rddX.collect().map(row =>
> if (row.length == 1) {
>   
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
> } else {
>   row.map(tool => {
> 
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
>   })
>   })
> 
> Wondering if there is any better/faster way to do this ?
> 
> Thanks.
> 
> 
> 
> On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin  > wrote:
> Hey Daniel, not sure this will help, but... I had a similar need where i 
> wanted the content of a dataframe to become a "cell" or a row in the parent 
> dataframe. I grouped by the child dataframe, then collect it as a list in the 
> parent dataframe after a join operation. As I said, not sure it matches your 
> use case, but HIH...
> jg
> 
>> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
>> > wrote:
>> 
>> Hi guys,
>> 
>> I'm having trouble implementing this scenario:
>> 
>> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
>> 'pear', 'pear']
>> 
>> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>> 
>> I'm attempting to do this but because of the nested operation on another RDD 
>> I get the NPE.
>> 
>> Here's my code so far, thanks:
>> 
>> val dfWithSchema = 
>> sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", "event_name")
>> 
>>   // attempting
>>   import sqlContext.implicits._
>>   val event_list = dfWithSchema.select("event_name").distinct
>>   val event_listDF = event_list.toDF()
>>   val eventIndexer = new StringIndexer()
>> .setInputCol("event_name")
>> .setOutputCol("eventIndex")
>> .fit(event_listDF)
>> 
>>   val eventIndexed = eventIndexer.transform(event_listDF)
>> 
>>   val converter = new IndexToString()
>> .setInputCol("eventIndex")
>> .setOutputCol("originalCategory")
>> 
>>   val convertedEvents = converter.transform(eventIndexed)
>>   val rddX = 
>> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>>  replaceAll ("[\\[\\]\"]", "")).toList)
>>   //val oneRow = 
>> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> 
>>   val severalRows = rddX.map(row => {
>> // Split array into n tools
>> println("ROW: " + row(0).toString)
>> println(row(0).getClass)
>> println("PRINT: " + 
>> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> 
>> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>>  Seq(row).toString)
>>   })
>>   // attempting
> 



Re: Nested RDD operation

2017-09-19 Thread ayan guha
How big is the list of fruits in your example? Can you broadcast it?

On Tue, 19 Sep 2017 at 9:21 pm, Daniel O' Shaughnessy <
danieljamesda...@gmail.com> wrote:

> Thanks for your response Jean.
>
> I managed to figure this out in the end but it's an extremely slow
> solution and not tenable for my use-case:
>
> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
> ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
> //val oneRow =
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
> rddX.take(5).foreach(println)
> val severalRows = rddX.collect().map(row =>
> if (row.length == 1) {
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
> ).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0
> ))
> } else {
> row.map(tool => {
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
> (tool.toString)).toDF("event_name")).select("eventIndex"
> ).first().getDouble(0))
> })
> })
>
> Wondering if there is any better/faster way to do this ?
>
> Thanks.
>
>
>
> On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin  wrote:
>
>> Hey Daniel, not sure this will help, but... I had a similar need where i
>> wanted the content of a dataframe to become a "cell" or a row in the parent
>> dataframe. I grouped by the child dataframe, then collect it as a list in
>> the parent dataframe after a join operation. As I said, not sure it matches
>> your use case, but HIH...
>> jg
>>
>> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy <
>> danieljamesda...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I'm having trouble implementing this scenario:
>>
>> I have a column with a typical entry being : ['apple', 'orange', 'apple',
>> 'pear', 'pear']
>>
>> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>>
>> I'm attempting to do this but because of the nested operation on another
>> RDD I get the NPE.
>>
>> Here's my code so far, thanks:
>>
>> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF(
>> "email", "event_name")
>>
>> // attempting
>> import sqlContext.implicits._
>> val event_list = dfWithSchema.select("event_name").distinct
>> val event_listDF = event_list.toDF()
>> val eventIndexer = new StringIndexer()
>> .setInputCol("event_name")
>> .setOutputCol("eventIndex")
>> .fit(event_listDF)
>>
>> val eventIndexed = eventIndexer.transform(event_listDF)
>>
>> val converter = new IndexToString()
>> .setInputCol("eventIndex")
>> .setOutputCol("originalCategory")
>>
>> val convertedEvents = converter.transform(eventIndexed)
>> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0
>> ).split(",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
>> //val oneRow =
>> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>>
>> val severalRows = rddX.map(row => {
>> // Split array into n tools
>> println("ROW: " + row(0).toString)
>> println(row(0).getClass)
>> println("PRINT: " +
>> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
>> ))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
>> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq
>> (row).toString)
>> })
>> // attempting
>>
>>
>> --
Best Regards,
Ayan Guha


Re: Nested RDD operation

2017-09-19 Thread Daniel O' Shaughnessy
Thanks for your response Jean.

I managed to figure this out in the end but it's an extremely slow solution
and not tenable for my use-case:

val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
//val oneRow =
Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
rddX.take(5).foreach(println)
val severalRows = rddX.collect().map(row =>
if (row.length == 1) {
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
} else {
row.map(tool => {
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(
0))
})
})

Wondering if there is any better/faster way to do this ?

Thanks.



On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin  wrote:

> Hey Daniel, not sure this will help, but... I had a similar need where i
> wanted the content of a dataframe to become a "cell" or a row in the parent
> dataframe. I grouped by the child dataframe, then collect it as a list in
> the parent dataframe after a join operation. As I said, not sure it matches
> your use case, but HIH...
> jg
>
> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy <
> danieljamesda...@gmail.com> wrote:
>
> Hi guys,
>
> I'm having trouble implementing this scenario:
>
> I have a column with a typical entry being : ['apple', 'orange', 'apple',
> 'pear', 'pear']
>
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>
> I'm attempting to do this but because of the nested operation on another
> RDD I get the NPE.
>
> Here's my code so far, thanks:
>
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF(
> "email", "event_name")
>
> // attempting
> import sqlContext.implicits._
> val event_list = dfWithSchema.select("event_name").distinct
> val event_listDF = event_list.toDF()
> val eventIndexer = new StringIndexer()
> .setInputCol("event_name")
> .setOutputCol("eventIndex")
> .fit(event_listDF)
>
> val eventIndexed = eventIndexer.transform(event_listDF)
>
> val converter = new IndexToString()
> .setInputCol("eventIndex")
> .setOutputCol("originalCategory")
>
> val convertedEvents = converter.transform(eventIndexed)
> val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
> ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
> //val oneRow =
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>
> val severalRows = rddX.map(row => {
> // Split array into n tools
> println("ROW: " + row(0).toString)
> println(row(0).getClass)
> println("PRINT: " +
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
> ))).toDF("event_name")).select("eventIndex").first().getDouble(0))
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq
> (row)).toDF("event_name")).select("eventIndex").first().getDouble(0), Seq
> (row).toString)
> })
> // attempting
>
>
>


Re: Nested RDD operation

2017-09-15 Thread Jean Georges Perrin
Hey Daniel, not sure this will help, but... I had a similar need where i wanted 
the content of a dataframe to become a "cell" or a row in the parent dataframe. 
I grouped by the child dataframe, then collect it as a list in the parent 
dataframe after a join operation. As I said, not sure it matches your use case, 
but HIH...
jg

> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
>  wrote:
> 
> Hi guys,
> 
> I'm having trouble implementing this scenario:
> 
> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
> 'pear', 'pear']
> 
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
> 
> I'm attempting to do this but because of the nested operation on another RDD 
> I get the NPE.
> 
> Here's my code so far, thanks:
> 
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", 
> "event_name")
> 
>   // attempting
>   import sqlContext.implicits._
>   val event_list = dfWithSchema.select("event_name").distinct
>   val event_listDF = event_list.toDF()
>   val eventIndexer = new StringIndexer()
> .setInputCol("event_name")
> .setOutputCol("eventIndex")
> .fit(event_listDF)
> 
>   val eventIndexed = eventIndexer.transform(event_listDF)
> 
>   val converter = new IndexToString()
> .setInputCol("eventIndex")
> .setOutputCol("originalCategory")
> 
>   val convertedEvents = converter.transform(eventIndexed)
>   val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>   //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
>   val severalRows = rddX.map(row => {
> // Split array into n tools
> println("ROW: " + row(0).toString)
> println(row(0).getClass)
> println("PRINT: " + 
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>  Seq(row).toString)
>   })
>   // attempting



Nested RDD operation

2017-09-15 Thread Daniel O' Shaughnessy
Hi guys,

I'm having trouble implementing this scenario:

I have a column with a typical entry being : ['apple', 'orange', 'apple',
'pear', 'pear']

I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]

I'm attempting to do this but because of the nested operation on another
RDD I get the NPE.

Here's my code so far, thanks:

val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email",
"event_name")

// attempting
import sqlContext.implicits._
val event_list = dfWithSchema.select("event_name").distinct
val event_listDF = event_list.toDF()
val eventIndexer = new StringIndexer()
.setInputCol("event_name")
.setOutputCol("eventIndex")
.fit(event_listDF)

val eventIndexed = eventIndexer.transform(event_listDF)

val converter = new IndexToString()
.setInputCol("eventIndex")
.setOutputCol("originalCategory")

val convertedEvents = converter.transform(eventIndexed)
val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
//val oneRow =
Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))

val severalRows = rddX.map(row => {
// Split array into n tools
println("ROW: " + row(0).toString)
println(row(0).getClass)
println("PRINT: " +
eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
))).toDF("event_name")).select("eventIndex").first().getDouble(0))
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF(
"event_name")).select("eventIndex").first().getDouble(0), Seq(row).toString)
})
// attempting


nested rdd operation

2014-09-10 Thread Pavlos Katsogridakis

Hi ,

I have a question on spark
this programm on spark-shell

val filerdd = sc.textFile(NOTICE,2)
val maprdd = filerdd.map( word = filerdd.map( word2 = (word2+word)  ) )
maprdd.collect()

throws NULL pointer exception ,
can somebody explain why i cannot have a nested rdd operation ?

--pavlos

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: nested rdd operation

2014-09-10 Thread Sean Owen
You can't use an RDD inside an operation on an RDD. Here you have
filerdd in your map function. It sort of looks like you want a
cartesian product of the RDD with itself, so look at the cartesian()
method. It may not be a good idea to compute such a thing.

On Wed, Sep 10, 2014 at 1:57 PM, Pavlos Katsogridakis
kats...@ics.forth.gr wrote:
 Hi ,

 I have a question on spark
 this programm on spark-shell

 val filerdd = sc.textFile(NOTICE,2)
 val maprdd = filerdd.map( word = filerdd.map( word2 = (word2+word)  ) )
 maprdd.collect()

 throws NULL pointer exception ,
 can somebody explain why i cannot have a nested rdd operation ?

 --pavlos

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org