RE: RDD order preservation through transformations

2017-09-15 Thread johan.grande.ext
Thanks all for your answers. After reading the provided links I am still 
uncertain of the details of what I'd need to do to get my calculations right 
with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
the libs and I think they'll be better suited to my needs.

Best,
Johan Grande


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Nested RDD operation

2017-09-15 Thread Daniel O' Shaughnessy
Hi guys,

I'm having trouble implementing this scenario:

I have a column with a typical entry being : ['apple', 'orange', 'apple',
'pear', 'pear']

I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]

I'm attempting to do this but because of the nested operation on another
RDD I get the NPE.

Here's my code so far, thanks:

val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email",
"event_name")

// attempting
import sqlContext.implicits._
val event_list = dfWithSchema.select("event_name").distinct
val event_listDF = event_list.toDF()
val eventIndexer = new StringIndexer()
.setInputCol("event_name")
.setOutputCol("eventIndex")
.fit(event_listDF)

val eventIndexed = eventIndexer.transform(event_listDF)

val converter = new IndexToString()
.setInputCol("eventIndex")
.setOutputCol("originalCategory")

val convertedEvents = converter.transform(eventIndexed)
val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split(
",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList)
//val oneRow =
Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))

val severalRows = rddX.map(row => {
// Split array into n tools
println("ROW: " + row(0).toString)
println(row(0).getClass)
println("PRINT: " +
eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0
))).toDF("event_name")).select("eventIndex").first().getDouble(0))
(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF(
"event_name")).select("eventIndex").first().getDouble(0), Seq(row).toString)
})
// attempting


Re: RDD order preservation through transformations

2017-09-15 Thread Suzen, Mehmet
Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering
issues are different there. Maybe you could create minimally large
enough simulated data and example series of transformations as an
example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD


| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
and any documents, files or previous e-mail messages attached to it,
may contain confidential information that is legally privileged. If
you are not the intended recipient or a person responsible for
delivering it to the intended recipient, you are hereby notified that
any disclosure, copying, distribution or use of any of the information
contained in or attached to this transmission is STRICTLY PROHIBITED
within the applicable law. If you have received this transmission in
error, please: (1) immediately notify me by reply e-mail to
su...@acm.org,  and (2) destroy the original transmission and its
attachments without reading or saving in any manner. |


On 15 September 2017 at 09:44,   wrote:
> Thanks all for your answers. After reading the provided links I am still 
> uncertain of the details of what I'd need to do to get my calculations right 
> with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
> the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Nested RDD operation

2017-09-15 Thread Jean Georges Perrin
Hey Daniel, not sure this will help, but... I had a similar need where i wanted 
the content of a dataframe to become a "cell" or a row in the parent dataframe. 
I grouped by the child dataframe, then collect it as a list in the parent 
dataframe after a join operation. As I said, not sure it matches your use case, 
but HIH...
jg

> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
>  wrote:
> 
> Hi guys,
> 
> I'm having trouble implementing this scenario:
> 
> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
> 'pear', 'pear']
> 
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
> 
> I'm attempting to do this but because of the nested operation on another RDD 
> I get the NPE.
> 
> Here's my code so far, thanks:
> 
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", 
> "event_name")
> 
>   // attempting
>   import sqlContext.implicits._
>   val event_list = dfWithSchema.select("event_name").distinct
>   val event_listDF = event_list.toDF()
>   val eventIndexer = new StringIndexer()
> .setInputCol("event_name")
> .setOutputCol("eventIndex")
> .fit(event_listDF)
> 
>   val eventIndexed = eventIndexer.transform(event_listDF)
> 
>   val converter = new IndexToString()
> .setInputCol("eventIndex")
> .setOutputCol("originalCategory")
> 
>   val convertedEvents = converter.transform(eventIndexed)
>   val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>   //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
>   val severalRows = rddX.map(row => {
> // Split array into n tools
> println("ROW: " + row(0).toString)
> println(row(0).getClass)
> println("PRINT: " + 
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>  Seq(row).toString)
>   })
>   // attempting



Size exceeds Integer.MAX_VALUE issue with RandomForest

2017-09-15 Thread rpulluru
Hi,

I am using sparkR randomForest function and running into 
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE issue.
Looks like I am running into this issue 
https://issues.apache.org/jira/browse/SPARK-1476, I used
spark.default.parallelism=1000 but still facing the same issue.

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: RDD order preservation through transformations

2017-09-15 Thread johan.grande.ext
Well, the dataframes make it easier to work on some columns of the data only 
and to store results in new columns, removing the need to zip it all back 
together and thus to preserve order.


On 2017-09-05 14:04 CEST, mehmet.su...@gmail.com wrote:

Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering issues are 
different there. Maybe you could create minimally large enough simulated data 
and example series of transformations as an example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD




On 15 September 2017 at 09:44,   wrote:
> Thanks all for your answers. After reading the provided links I am still 
> uncertain of the details of what I'd need to do to get my calculations right 
> with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
> the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



[SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-15 Thread Arun Khetarpal
Hi - 

Wanted to understand if spark sql has GRANT and REVOKE statements available? 
Is anyone working on making that available? 

Regards,
Arun

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.streaming.receiver.maxRate

2017-09-15 Thread Margus Roo

Hi

I tested |spark.streaming.receiver.maxRate and 
||spark.streaming.backpressure.enabled settings using socketStream and 
it works.|


|But if I am using nifi-spark-receiver 
(https://mvnrepository.com/artifact/org.apache.nifi/nifi-spark-receiver) 
then it does not using |

||spark.streaming.receiver.maxRate
||

||any workaround?
||

||

Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780

On 14/09/2017 09:57, Margus Roo wrote:


Hi

Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 
2.11.8 and Java 1.8.0_60


I have Nifi flow produces more records than Spark stream can work in 
batch time. To avoid spark queue overflow I wanted to try spark 
streaming backpressure (did not work for my) so back to the more 
simple but static solution I tried spark.streaming.receiver.maxRate.


I set it spark.streaming.receiver.maxRate=1. As I understand it from 
Spark manual: "If the batch processing time is more than batchinterval 
then obviously the receiver’s memory will start filling up and will 
end up in throwing exceptions (most probably BlockNotFoundException). 
Currently there is no way to pause the receiver. Using SparkConf 
configuration|spark.streaming.receiver.maxRate|, rate of receiver can 
be limited." - it means 1 record per second?


I have very simple code:

val conf =new 
SiteToSiteClient.Builder().url("http://192.168.80.120:9090/nifi";).portName("testing").buildConfig()
val ssc =new StreamingContext(sc, Seconds(1))

val lines = ssc.receiverStream(new NiFiReceiver(conf, 
StorageLevel.MEMORY_AND_DISK))
lines.print()

ssc.start()


I have loads of records waiting in Nifi testing port. After I start 
ssc.start() I will get 9-7 records per batch (batch time is 1s). Do I 
understand spark.streaming.receiver.maxRate wrong?


--
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780




Re: spark.streaming.receiver.maxRate

2017-09-15 Thread Margus Roo

Some more info

val lines = ssc.socketStream() // works
val lines = ssc.receiverStream(new NiFiReceiver(conf, 
StorageLevel.MEMORY_AND_DISK_SER_2)) // does not work

Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780

On 15/09/2017 21:50, Margus Roo wrote:


Hi

I tested |spark.streaming.receiver.maxRate and 
||spark.streaming.backpressure.enabled settings using socketStream and 
it works.|


|But if I am using nifi-spark-receiver 
(https://mvnrepository.com/artifact/org.apache.nifi/nifi-spark-receiver) 
then it does not using |

||spark.streaming.receiver.maxRate
||

||any workaround?
||

||

Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780
On 14/09/2017 09:57, Margus Roo wrote:


Hi

Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 
2.11.8 and Java 1.8.0_60


I have Nifi flow produces more records than Spark stream can work in 
batch time. To avoid spark queue overflow I wanted to try spark 
streaming backpressure (did not work for my) so back to the more 
simple but static solution I tried spark.streaming.receiver.maxRate.


I set it spark.streaming.receiver.maxRate=1. As I understand it from 
Spark manual: "If the batch processing time is more than 
batchinterval then obviously the receiver’s memory will start filling 
up and will end up in throwing exceptions (most probably 
BlockNotFoundException). Currently there is no way to pause the 
receiver. Using SparkConf 
configuration|spark.streaming.receiver.maxRate|, rate of receiver can 
be limited." - it means 1 record per second?


I have very simple code:

val conf =new 
SiteToSiteClient.Builder().url("http://192.168.80.120:9090/nifi";).portName("testing").buildConfig()
val ssc =new StreamingContext(sc, Seconds(1))

val lines = ssc.receiverStream(new NiFiReceiver(conf, 
StorageLevel.MEMORY_AND_DISK))
lines.print()

ssc.start()


I have loads of records waiting in Nifi testing port. After I start 
ssc.start() I will get 9-7 records per batch (batch time is 1s). Do I 
understand spark.streaming.receiver.maxRate wrong?


--
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780