Re: Difference between 'cores' config params: spark submit on k8s

2019-04-20 Thread Li Gao
hi Battini,

The limit is a k8s construct that tells k8s how much cpu/cores your driver
*can* consume.

when you have the same value for 'spark.driver.cores' and '
spark.kubernetes.driver.limit.cores' your driver then runs at the
'Guranteed' k8s quality of service class, which can make your driver less
chance gets evicted by the scheduler.

The same goes with the executor settings.

https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

the QoS Guarantee is important when you are a mutitenant k8s cluster in
production.

Cheers,
Li


On Thu, Mar 7, 2019, 1:53 PM Battini Lakshman 
wrote:

> Hello,
>
> I understand we need to specify the 'spark.kubernetes.driver.limit.cores'
> and 'spark.kubernetes.executor.limit.cores' config parameters while
> submitting spark on k8s namespace with resource quota applied.
>
> There are also other config parameters 'spark.driver.cores' and
> 'spark.executor.cores' mentioned in documentation. What is the difference
> between '' and 'spark.kubernetes.driver.limit.cores' please.
>
> Thanks!
>
> Best Regards,
> Lakshman B.
>


How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql:

select row_number()
  over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,*
  from flights

This will *not* work in *structured streaming* : The culprit is:

 partition by Origin

The requirement is to use a timestamp-typed field such as

 partition by flightTime

Tathagata Das (core committer for *spark streaming*) - replies on that in a
nabble thread:

 The traditional SQL windows with `over` is not supported in streaming.
Only time-based windows, that is, `window("timestamp", "10 minutes")` is
supported in streaming

*W**hat then* for my query above - which *must* be based on the *Origin* field?
What is the closest equivalent to that query? Or what would be a workaround
or different approach to achieve same results?


repartition in df vs partitionBy in df

2019-04-20 Thread kumar.rajat20del
Hi Spark Users,

repartition and partitionBy seems to be very same in Df. 
In which scenario we use one?

As per my understanding repartition is very expensive operation as it needs
full shuffle then when do we use repartition ?

Thanks
Rajat



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: toDebugString - RDD Logical Plan

2019-04-20 Thread Dylan Guedes
Kanchan,
the `toDebugString` looks unformatted because in some scenarios you need to
parse it before (can't remember the reason, though). I suggest you to print
the RDD Lineage using
`print(rdd.toDebugString().decode("utf-8"))` instead (obs: this only occurs
in Pyspark).

About the other question, you may use `getNumberPartitions`.

On Sat, Apr 20, 2019 at 2:40 PM kanchan tewary 
wrote:

> Dear All,
>
> Greetings!
>
> I am new to Apache Spark and working on RDDs using pyspark. I am trying to
> understand the logical plan provided by toDebugString function, but I find
> two issues a) the output is not formatted when I print the result
> b) I do not see number of partitions shown.
>
> Can anyone direct me to any reference documentation to understand the
> logical plan better? Or, do you suggest to use DAG from spark UI instead?
>
>
> Thanks & Best Regards,
> Kanchan
> Data Engineer, IBM
>


toDebugString - RDD Logical Plan

2019-04-20 Thread kanchan tewary
Dear All,

Greetings!

I am new to Apache Spark and working on RDDs using pyspark. I am trying to
understand the logical plan provided by toDebugString function, but I find
two issues a) the output is not formatted when I print the result
b) I do not see number of partitions shown.

Can anyone direct me to any reference documentation to understand the
logical plan better? Or, do you suggest to use DAG from spark UI instead?


Thanks & Best Regards,
Kanchan
Data Engineer, IBM


Feature engineering ETL for machine learning

2019-04-20 Thread Subash Prabakar
Hi,

I have a series of queries to extract from multiple tables in hive and do a
feature engineering on the extracted final data.. I can run queries using
spark sql  and use mllib to perform the feature transformation I needed.
The question is do you guys use any kind of tool to perform this workflow
for executing the query or is there any tool which on giving a
template/JSON will construct the spark sql queries for me?


Thanks
Subash


Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Jason Nerothin
Hi Rajat,

A little more color:

The executor classpath will be used by the spark workers/slaves. For
example, all JVMs that are started with $SPARK_HOME/sbin/start-slave.sh. If
you run with --deploy-mode cluster, then the driver itself will be run from
on the cluster (with executor classpath).

If you run with --deploy-mode client, then the Driver will have its own
classpath (and JVM) on the host that you start it from (similar to running
from an IDE).

If you are NOT running from the shell, then it's usually best to build an
uber-jar containing all required jars and use spark-submit to send the
entire classpath to the cluster. Using --packages like this

is
also a good option for jars that are in a repository (and also resolves
local paths during development).

To run driver code from an IDE, make sure your run/debug configuration is
picking up the spark libs you need as "provided" dependencies (sbt or
maven). This simulates the classpath that's provided by the Spark runtime.
I say 'simulates' because Spark (2.4.1) has access to 226 jar files in
$SPARK_HOME/jars and usually you're implementing against just a few of the
essential ones like spark-sql.

--jars is what to use for spark-shell.

Final related tidbit: If you're implementing in Scala, make sure your jars
are version-compatible with the scala compiler version (2.1.1 as of Spark
2.4.1).

HTH

Jason

On Sat, Apr 20, 2019 at 4:34 AM Subash Prabakar 
wrote:

> Hey Rajat,
>
> The documentation page is self explanatory..
>
> You can refer this for more configs
>
> https://spark.apache.org/docs/2.0.0/configuration.html
>  or any version of Spark documentation
>
> Thanks.
> Subash
>
> On Sat, 20 Apr 2019 at 16:04, rajat kumar 
> wrote:
>
>> Hi,
>>
>> Can anyone pls explain ?
>>
>>
>> On Mon, 15 Apr 2019, 09:31 rajat kumar >
>>> Hi All,
>>>
>>> I came across different parameters in spark submit
>>>
>>> --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath
>>>
>>> What are the differences between them? When to use which one? Will it
>>> differ
>>> if I use following:
>>>
>>> --master yarn --deploy-mode client
>>> --master yarn --deploy-mode cluster
>>>
>>>
>>> Thanks
>>> Rajat
>>>
>>

-- 
Thanks,
Jason


Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Subash Prabakar
Hey Rajat,

The documentation page is self explanatory..

You can refer this for more configs

https://spark.apache.org/docs/2.0.0/configuration.html
 or any version of Spark documentation

Thanks.
Subash

On Sat, 20 Apr 2019 at 16:04, rajat kumar 
wrote:

> Hi,
>
> Can anyone pls explain ?
>
>
> On Mon, 15 Apr 2019, 09:31 rajat kumar 
>> Hi All,
>>
>> I came across different parameters in spark submit
>>
>> --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath
>>
>> What are the differences between them? When to use which one? Will it
>> differ
>> if I use following:
>>
>> --master yarn --deploy-mode client
>> --master yarn --deploy-mode cluster
>>
>>
>> Thanks
>> Rajat
>>
>


Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread rajat kumar
Hi,

Can anyone pls explain ?

On Mon, 15 Apr 2019, 09:31 rajat kumar  Hi All,
>
> I came across different parameters in spark submit
>
> --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath
>
> What are the differences between them? When to use which one? Will it
> differ
> if I use following:
>
> --master yarn --deploy-mode client
> --master yarn --deploy-mode cluster
>
>
> Thanks
> Rajat
>