Re: PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-16 Thread Marco Mistroni
Not exactly...I was not going to flatmap the rdd
In the end I amended my approach to the problem and managed to get the
flatmap on the dataset
Thx for answering
Kr

On Sep 16, 2017 4:53 PM, "Akhil Das"  wrote:

> scala> case class Fruit(price: Double, name: String)
> defined class Fruit
>
> scala> val ds = Seq(Fruit(10.0,"Apple")).toDS()
> ds: org.apache.spark.sql.Dataset[Fruit] = [price: double, name: string]
>
> scala> ds.rdd.flatMap(f => f.name.toList).collect
> res8: Array[Char] = Array(A, p, p, l, e)
>
>
> This is what you want to do?
>
> On Fri, Sep 15, 2017 at 4:21 AM, Marco Mistroni 
> wrote:
>
>> HI all
>>  could anyone assist pls?
>> i am trying to flatMap a DataSet[(String, String)] and i am getting
>> errors in Eclipse
>> the errors are more Scala related than spark -related, but i was
>> wondering if someone came across
>> a similar situation
>>
>> here's what i got. A DS of (String, String) , out of which i am using
>> flatMap to get a List[Char] of for the second element in the tuple.
>>
>> val tplDataSet = < DataSet[(String, String)] >
>>
>> val expanded = tplDataSet.flatMap(tpl  => tpl._2.toList,
>> Encoders.product[(String, String)])
>>
>>
>> Eclipse complains that  'tpl' in the above function is missing parameter
>> type
>>
>> what am i missing? or perhaps i am using the wrong approach?
>>
>> w/kindest regards
>>  Marco
>>
>
>
>
> --
> Cheers!
>
>


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Femi Anthony
How are you specifying it, as an option to spark-submit ?

On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das  wrote:

> spark.sql.shuffle.partitions is still used I believe. I can see it in the
> code
> 
>  and
> in the documentation page
> 
> .
>
> On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:
>
>> Hello,
>>
>> I am running unit tests with Spark DataFrames, and I am looking for
>> configuration tweaks that would make tests faster. Usually, I use a
>> local[2] or local[4] master.
>>
>> Something that has been bothering me is that most of my stages end up
>> using 200 partitions, independently of whether I repartition the input.
>> This seems a bit overkill for small unit tests that barely have 200 rows
>> per DataFrame.
>>
>> spark.sql.shuffle.partitions used to control this I believe, but it seems
>> to be gone and I could not find any information on what mechanism/setting
>> replaces it or the corresponding JIRA.
>>
>> Has anyone experience to share on how to tune Spark best for very small
>> local runs like that?
>>
>> Thanks!
>>
>>
>
>
> --
> Cheers!
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Akhil Das
spark.sql.shuffle.partitions is still used I believe. I can see it in the
code

and
in the documentation page

.

On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:

> Hello,
>
> I am running unit tests with Spark DataFrames, and I am looking for
> configuration tweaks that would make tests faster. Usually, I use a
> local[2] or local[4] master.
>
> Something that has been bothering me is that most of my stages end up
> using 200 partitions, independently of whether I repartition the input.
> This seems a bit overkill for small unit tests that barely have 200 rows
> per DataFrame.
>
> spark.sql.shuffle.partitions used to control this I believe, but it seems
> to be gone and I could not find any information on what mechanism/setting
> replaces it or the corresponding JIRA.
>
> Has anyone experience to share on how to tune Spark best for very small
> local runs like that?
>
> Thanks!
>
>


-- 
Cheers!


Re: PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-16 Thread Akhil Das
scala> case class Fruit(price: Double, name: String)
defined class Fruit

scala> val ds = Seq(Fruit(10.0,"Apple")).toDS()
ds: org.apache.spark.sql.Dataset[Fruit] = [price: double, name: string]

scala> ds.rdd.flatMap(f => f.name.toList).collect
res8: Array[Char] = Array(A, p, p, l, e)


This is what you want to do?

On Fri, Sep 15, 2017 at 4:21 AM, Marco Mistroni  wrote:

> HI all
>  could anyone assist pls?
> i am trying to flatMap a DataSet[(String, String)] and i am getting errors
> in Eclipse
> the errors are more Scala related than spark -related, but i was wondering
> if someone came across
> a similar situation
>
> here's what i got. A DS of (String, String) , out of which i am using
> flatMap to get a List[Char] of for the second element in the tuple.
>
> val tplDataSet = < DataSet[(String, String)] >
>
> val expanded = tplDataSet.flatMap(tpl  => tpl._2.toList,
> Encoders.product[(String, String)])
>
>
> Eclipse complains that  'tpl' in the above function is missing parameter
> type
>
> what am i missing? or perhaps i am using the wrong approach?
>
> w/kindest regards
>  Marco
>



-- 
Cheers!


Re: Size exceeds Integer.MAX_VALUE issue with RandomForest

2017-09-16 Thread Akhil Das
What are the parameters you passed to the classifier and what is the size
of your train data? You are hitting that issue because one of the block
size is over 2G, repartitioning the data will help.

On Fri, Sep 15, 2017 at 7:55 PM, rpulluru  wrote:

> Hi,
>
> I am using sparkR randomForest function and running into
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE issue.
> Looks like I am running into this issue
> https://issues.apache.org/jira/browse/SPARK-1476, I used
> spark.default.parallelism=1000 but still facing the same issue.
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!


Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Jörn Franke
It depends on the permissions the user has on the local file system or HDFS, so 
there is no need to have grant/revoke.

> On 15. Sep 2017, at 17:13, Arun Khetarpal  wrote:
> 
> Hi - 
> 
> Wanted to understand if spark sql has GRANT and REVOKE statements available? 
> Is anyone working on making that available? 
> 
> Regards,
> Arun
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Akhil Das
I guess no. I came across a test case where they are marked as Unsupported,
you can see it here.

However,
the one running inside Databricks has support for this.
https://docs.databricks.com/spark/latest/spark-sql/structured-data-access-controls.html

On Fri, Sep 15, 2017 at 10:13 PM, Arun Khetarpal 
wrote:

> Hi -
>
> Wanted to understand if spark sql has GRANT and REVOKE statements
> available?
> Is anyone working on making that available?
>
> Regards,
> Arun
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!


Re: spark.streaming.receiver.maxRate

2017-09-16 Thread Akhil Das
I believe that's a question to the NiFi list, as you can see the the code
base is quite old
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver/src/main/java/org/apache/nifi/spark
and it doesn't make use of the
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/receiver/RateLimiter.scala


On Sat, Sep 16, 2017 at 1:59 AM, Margus Roo  wrote:

> Some more info
>
> val lines = ssc.socketStream() // worksval lines = ssc.receiverStream(new 
> NiFiReceiver(conf, StorageLevel.MEMORY_AND_DISK_SER_2)) // does not work
>
> Margus (margusja) Roohttp://margus.roo.ee
> skype: margusjahttps://www.facebook.com/allan.tuuring
> +372 51 48 780
>
> On 15/09/2017 21:50, Margus Roo wrote:
>
> Hi
>
> I tested spark.streaming.receiver.maxRate and
> spark.streaming.backpressure.enabled settings using socketStream and it
> works.
>
> But if I am using nifi-spark-receiver (https://mvnrepository.com/
> artifact/org.apache.nifi/nifi-spark-receiver) then it does not using
> spark.streaming.receiver.maxRate
>
> any workaround?
>
> Margus (margusja) Roohttp://margus.roo.ee
> skype: margusjahttps://www.facebook.com/allan.tuuring
> +372 51 48 780
>
> On 14/09/2017 09:57, Margus Roo wrote:
>
> Hi
>
> Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8
> and Java 1.8.0_60
>
> I have Nifi flow produces more records than Spark stream can work in batch
> time. To avoid spark queue overflow I wanted to try spark streaming
> backpressure (did not work for my) so back to the more simple but static
> solution I tried spark.streaming.receiver.maxRate.
>
> I set it spark.streaming.receiver.maxRate=1. As I understand it from
> Spark manual: "If the batch processing time is more than batchinterval
> then obviously the receiver’s memory will start filling up and will end up
> in throwing exceptions (most probably BlockNotFoundException). Currently
> there is no way to pause the receiver. Using SparkConf configuration
> spark.streaming.receiver.maxRate, rate of receiver can be limited." - it
> means 1 record per second?
>
> I have very simple code:
>
> val conf = new 
> SiteToSiteClient.Builder().url("http://192.168.80.120:9090/nifi; 
> ).portName("testing").buildConfig()val ssc = 
> new StreamingContext(sc, Seconds(1))
> val lines = ssc.receiverStream(new NiFiReceiver(conf, 
> StorageLevel.MEMORY_AND_DISK))
> lines.print()
>
> ssc.start()
>
>
> I have loads of records waiting in Nifi testing port. After I start
> ssc.start() I will get 9-7 records per batch (batch time is 1s). Do I
> understand spark.streaming.receiver.maxRate wrong?
>
> --
> Margus (margusja) Roohttp://margus.roo.ee
> skype: margusjahttps://www.facebook.com/allan.tuuring
> +372 51 48 780
>
>
>
>


-- 
Cheers!