RE: RDD order preservation through transformations
Thanks all for your answers. After reading the provided links I am still uncertain of the details of what I'd need to do to get my calculations right with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of the libs and I think they'll be better suited to my needs. Best, Johan Grande _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Nested RDD operation
Hi guys, I'm having trouble implementing this scenario: I have a column with a typical entry being : ['apple', 'orange', 'apple', 'pear', 'pear'] I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1] I'm attempting to do this but because of the nested operation on another RDD I get the NPE. Here's my code so far, thanks: val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", "event_name") // attempting import sqlContext.implicits._ val event_list = dfWithSchema.select("event_name").distinct val event_listDF = event_list.toDF() val eventIndexer = new StringIndexer() .setInputCol("event_name") .setOutputCol("eventIndex") .fit(event_listDF) val eventIndexed = eventIndexer.transform(event_listDF) val converter = new IndexToString() .setInputCol("eventIndex") .setOutputCol("originalCategory") val convertedEvents = converter.transform(eventIndexed) val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split( ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList) //val oneRow = Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0)) val severalRows = rddX.map(row => { // Split array into n tools println("ROW: " + row(0).toString) println(row(0).getClass) println("PRINT: " + eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0 ))).toDF("event_name")).select("eventIndex").first().getDouble(0)) (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF( "event_name")).select("eventIndex").first().getDouble(0), Seq(row).toString) }) // attempting
Re: RDD order preservation through transformations
Hi Johan, DataFrames are building on top of RDDs, not sure if the ordering issues are different there. Maybe you could create minimally large enough simulated data and example series of transformations as an example to experiment on. Best, -m Mehmet Süzen, MSc, PhD | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information that is legally privileged. If you are not the intended recipient or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED within the applicable law. If you have received this transmission in error, please: (1) immediately notify me by reply e-mail to su...@acm.org, and (2) destroy the original transmission and its attachments without reading or saving in any manner. | On 15 September 2017 at 09:44, wrote: > Thanks all for your answers. After reading the provided links I am still > uncertain of the details of what I'd need to do to get my calculations right > with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of > the libs and I think they'll be better suited to my needs. > > Best, > Johan Grande > > > _ > > Ce message et ses pieces jointes peuvent contenir des informations > confidentielles ou privilegiees et ne doivent donc > pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu > ce message par erreur, veuillez le signaler > a l'expediteur et le detruire ainsi que les pieces jointes. Les messages > electroniques etant susceptibles d'alteration, > Orange decline toute responsabilite si ce message a ete altere, deforme ou > falsifie. Merci. > > This message and its attachments may contain confidential or privileged > information that may be protected by law; > they should not be distributed, used or copied without authorisation. > If you have received this email in error, please notify the sender and delete > this message and its attachments. > As emails may be altered, Orange is not liable for messages that have been > modified, changed or falsified. > Thank you. > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Nested RDD operation
Hey Daniel, not sure this will help, but... I had a similar need where i wanted the content of a dataframe to become a "cell" or a row in the parent dataframe. I grouped by the child dataframe, then collect it as a list in the parent dataframe after a join operation. As I said, not sure it matches your use case, but HIH... jg > On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy > wrote: > > Hi guys, > > I'm having trouble implementing this scenario: > > I have a column with a typical entry being : ['apple', 'orange', 'apple', > 'pear', 'pear'] > > I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1] > > I'm attempting to do this but because of the nested operation on another RDD > I get the NPE. > > Here's my code so far, thanks: > > val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", > "event_name") > > // attempting > import sqlContext.implicits._ > val event_list = dfWithSchema.select("event_name").distinct > val event_listDF = event_list.toDF() > val eventIndexer = new StringIndexer() > .setInputCol("event_name") > .setOutputCol("eventIndex") > .fit(event_listDF) > > val eventIndexed = eventIndexer.transform(event_listDF) > > val converter = new IndexToString() > .setInputCol("eventIndex") > .setOutputCol("originalCategory") > > val convertedEvents = converter.transform(eventIndexed) > val rddX = > dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim > replaceAll ("[\\[\\]\"]", "")).toList) > //val oneRow = > Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0)) > > val severalRows = rddX.map(row => { > // Split array into n tools > println("ROW: " + row(0).toString) > println(row(0).getClass) > println("PRINT: " + > eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0)) > > (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0), > Seq(row).toString) > }) > // attempting
Size exceeds Integer.MAX_VALUE issue with RandomForest
Hi, I am using sparkR randomForest function and running into java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE issue. Looks like I am running into this issue https://issues.apache.org/jira/browse/SPARK-1476, I used spark.default.parallelism=1000 but still facing the same issue. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
RE: RDD order preservation through transformations
Well, the dataframes make it easier to work on some columns of the data only and to store results in new columns, removing the need to zip it all back together and thus to preserve order. On 2017-09-05 14:04 CEST, mehmet.su...@gmail.com wrote: Hi Johan, DataFrames are building on top of RDDs, not sure if the ordering issues are different there. Maybe you could create minimally large enough simulated data and example series of transformations as an example to experiment on. Best, -m Mehmet Süzen, MSc, PhD On 15 September 2017 at 09:44, wrote: > Thanks all for your answers. After reading the provided links I am still > uncertain of the details of what I'd need to do to get my calculations right > with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of > the libs and I think they'll be better suited to my needs. > > Best, > Johan Grande _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
[SPARK-SQL] Does spark-sql have Authorization built in?
Hi - Wanted to understand if spark sql has GRANT and REVOKE statements available? Is anyone working on making that available? Regards, Arun - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: spark.streaming.receiver.maxRate
Hi I tested |spark.streaming.receiver.maxRate and ||spark.streaming.backpressure.enabled settings using socketStream and it works.| |But if I am using nifi-spark-receiver (https://mvnrepository.com/artifact/org.apache.nifi/nifi-spark-receiver) then it does not using | ||spark.streaming.receiver.maxRate || ||any workaround? || || Margus (margusja) Roo http://margus.roo.ee skype: margusja https://www.facebook.com/allan.tuuring +372 51 48 780 On 14/09/2017 09:57, Margus Roo wrote: Hi Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8 and Java 1.8.0_60 I have Nifi flow produces more records than Spark stream can work in batch time. To avoid spark queue overflow I wanted to try spark streaming backpressure (did not work for my) so back to the more simple but static solution I tried spark.streaming.receiver.maxRate. I set it spark.streaming.receiver.maxRate=1. As I understand it from Spark manual: "If the batch processing time is more than batchinterval then obviously the receiver’s memory will start filling up and will end up in throwing exceptions (most probably BlockNotFoundException). Currently there is no way to pause the receiver. Using SparkConf configuration|spark.streaming.receiver.maxRate|, rate of receiver can be limited." - it means 1 record per second? I have very simple code: val conf =new SiteToSiteClient.Builder().url("http://192.168.80.120:9090/nifi";).portName("testing").buildConfig() val ssc =new StreamingContext(sc, Seconds(1)) val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_AND_DISK)) lines.print() ssc.start() I have loads of records waiting in Nifi testing port. After I start ssc.start() I will get 9-7 records per batch (batch time is 1s). Do I understand spark.streaming.receiver.maxRate wrong? -- Margus (margusja) Roo http://margus.roo.ee skype: margusja https://www.facebook.com/allan.tuuring +372 51 48 780
Re: spark.streaming.receiver.maxRate
Some more info val lines = ssc.socketStream() // works val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_AND_DISK_SER_2)) // does not work Margus (margusja) Roo http://margus.roo.ee skype: margusja https://www.facebook.com/allan.tuuring +372 51 48 780 On 15/09/2017 21:50, Margus Roo wrote: Hi I tested |spark.streaming.receiver.maxRate and ||spark.streaming.backpressure.enabled settings using socketStream and it works.| |But if I am using nifi-spark-receiver (https://mvnrepository.com/artifact/org.apache.nifi/nifi-spark-receiver) then it does not using | ||spark.streaming.receiver.maxRate || ||any workaround? || || Margus (margusja) Roo http://margus.roo.ee skype: margusja https://www.facebook.com/allan.tuuring +372 51 48 780 On 14/09/2017 09:57, Margus Roo wrote: Hi Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8 and Java 1.8.0_60 I have Nifi flow produces more records than Spark stream can work in batch time. To avoid spark queue overflow I wanted to try spark streaming backpressure (did not work for my) so back to the more simple but static solution I tried spark.streaming.receiver.maxRate. I set it spark.streaming.receiver.maxRate=1. As I understand it from Spark manual: "If the batch processing time is more than batchinterval then obviously the receiver’s memory will start filling up and will end up in throwing exceptions (most probably BlockNotFoundException). Currently there is no way to pause the receiver. Using SparkConf configuration|spark.streaming.receiver.maxRate|, rate of receiver can be limited." - it means 1 record per second? I have very simple code: val conf =new SiteToSiteClient.Builder().url("http://192.168.80.120:9090/nifi";).portName("testing").buildConfig() val ssc =new StreamingContext(sc, Seconds(1)) val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_AND_DISK)) lines.print() ssc.start() I have loads of records waiting in Nifi testing port. After I start ssc.start() I will get 9-7 records per batch (batch time is 1s). Do I understand spark.streaming.receiver.maxRate wrong? -- Margus (margusja) Roo http://margus.roo.ee skype: margusja https://www.facebook.com/allan.tuuring +372 51 48 780