Filtering based on a float value with more than one decimal place not working correctly in Pyspark dataframe

2018-09-25 Thread Meethu Mathew
Hi all, I tried the following code and the output was not as expected. schema = StructType([StructField('Id', StringType(), False), > StructField('Value', FloatType(), False)]) > df_test = spark.createDataFrame([('a',5.0),('b',1.236),('c',-0.31)],schema) df_test Output :

RE: Python kubernetes spark 2.4 branch

2018-09-25 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi Ilan/ Yinan, Yes my test case is also similar to the one described in https://issues.apache.org/jira/browse/SPARK-24736 My spark-submit is as follows: ./spark-submit --deploy-mode cluster --master k8s://https://10.75.145.23:8443 --conf spark.app.name=spark-py

Re: [DISCUSS] Cascades style CBO for Spark SQL

2018-09-25 Thread Xiao Li
Hi, Xiaoju, Thanks for sending this to the dev list. The current join reordering rule is just a stats based optimizer rule. Either top-down or bottom-up optimization can achieve the same-level optimized plans. DB2 is using bottom up. In the future, we plan to move the stats based join reordering

[Discuss] Language Interop for Apache Spark

2018-09-25 Thread tcondie
There seems to be some desire for third party language extensions for Apache Spark. Some notable examples include: * C#/F# from project Mobius https://github.com/Microsoft/Mobius * Haskell from project sparkle https://github.com/tweag/sparkle * Julia from project Spark.jl

Re: Python kubernetes spark 2.4 branch

2018-09-25 Thread Ilan Filonenko
Is this in reference to: https://issues.apache.org/jira/browse/SPARK-24736 ? On Tue, Sep 25, 2018 at 12:38 PM Yinan Li wrote: > Can you give more details on how you ran your app, did you build your own > image, and which image are you using? > > On Tue, Sep 25, 2018 at 10:23 AM Garlapati,

Accumulator issues in PySpark

2018-09-25 Thread Abdeali Kothari
I was trying to check out accumulators and see if I could use them for anything. I made a demo program and could not figure out how to add them up. I found that I need to do a shuffle between all my python UDFs that I am running for the accumulators to be run. Basically, if I do 5 withColumn()

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-25 Thread Ryan Blue
I agree with Wenchen that we'd remove the prefix when passing to a source, so you could use the same "spark.yarn.keytab" option in both places. But I think the problem is that "spark.yarn.keytab" still needs to be set, and it clearly isn't in a shared namespace for catalog options. So I think we

Re: Python kubernetes spark 2.4 branch

2018-09-25 Thread Yinan Li
Can you give more details on how you ran your app, did you build your own image, and which image are you using? On Tue, Sep 25, 2018 at 10:23 AM Garlapati, Suryanarayana (Nokia - IN/Bangalore) wrote: > Hi, > > I am trying to run spark python testcases on k8s based on tag > spark-2.4-rc1. When

Re: Support for Second level of concurrency

2018-09-25 Thread Sandeep Mahendru
Hey Jorn, Appreciate the prompt reply. Yeah that would surely work, we have tried a similar approach. The only concern here is that to make the solution low latency, we want to avoid routing through a message broker. Regards, Sandeep. On Tue, Sep 25, 2018 at 12:53 PM Jörn Franke wrote: >

Python kubernetes spark 2.4 branch

2018-09-25 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi, I am trying to run spark python testcases on k8s based on tag spark-2.4-rc1. When the dependent files are passed through the --py-files option, they are not getting resolved by the main python script. Please let me know, is this a known issue? Regards Surya

Re: Support for Second level of concurrency

2018-09-25 Thread Jörn Franke
What is the ultimate goal of this algorithm? There could be already algorithms that can do this within Spark. You could also put a message on Kafka (or another broker) and have spark applications listen to them to trigger further computation. This would be also more controlled and can be done

Re: Support for Second level of concurrency

2018-09-25 Thread Reynold Xin
That’s a pretty major architectural change and would be extremely difficult to do at this stage. On Tue, Sep 25, 2018 at 9:31 AM sandeep mehandru wrote: > Hi Folks, > >There is a use-case , where we are doing large computation on two large > vectors. It is basically a scenario, where we run

Support for Second level of concurrency

2018-09-25 Thread sandeep mehandru
Hi Folks, There is a use-case , where we are doing large computation on two large vectors. It is basically a scenario, where we run a flatmap operation on the Left vector and run co-relation logic by comparing it with all the rows of the second vector. When this flatmap operation is running on

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-25 Thread tigerquoll
To give some Kerberos specific examples, The spark-submit args: -–conf spark.yarn.keytab=path_to_keytab -–conf spark.yarn.principal=princi...@realm.com are currently not passed through to the data sources. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/