Re: [pyspark 2.3+] CountDistinct

2019-07-01 Thread Abdeali Kothari
I can't exactly reproduce this. Here is what I tried quickly: import uuid import findspark findspark.init() # noqa import pyspark from pyspark.sql import functions as F # noqa: N812 spark = pyspark.sql.SparkSession.builder.getOrCreate() df = spark.createDataFrame([ [str(uuid.uuid4()) for

Re: k8s orchestrating Spark service

2019-07-01 Thread Matt Cheah
> We’d like to deploy Spark Workers/Executors and Master (whatever master is > easiest to talk about since we really don’t care) in pods as we do with the > other services we use. Replace Spark Master with k8s if you insist. How do > the executors get deployed? When running Spark against

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
We have a machine Learning Server. It submits various jobs through the Spark Scala API. The Server is run in a pod deployed from a chart by k8s. It later uses the Spark API to submit jobs. I guess we find spark-submit to be a roadblock to our use of Spark and the k8s support is fine but how do you

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
k8s as master would be nice but doesn’t solve the problem of running the full cluster and is an orthogonal issue. We’d like to deploy Spark Workers/Executors and Master (whatever master is easiest to talk about since we really don’t care) in pods as we do with the other services we use. Replace

Re: k8s orchestrating Spark service

2019-07-01 Thread Matt Cheah
Sorry, I don’t quite follow – why use the Spark standalone cluster as an in-between layer when one can just deploy the Spark application directly inside the Helm chart? I’m curious as to what the use case is, since I’m wondering if there’s something we can improve with respect to the native

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
Thanks Matt, Actually I can’t use spark-submit. We submit the Driver programmatically through the API. But this is not the issue and using k8s as the master is also not the issue though you may be right about it being easier, it doesn’t quite get to the heart. We want to orchestrate a bunch of

Re: k8s orchestrating Spark service

2019-07-01 Thread Matt Cheah
I would recommend looking into Spark’s native support for running on Kubernetes. One can just start the application against Kubernetes directly using spark-submit in cluster mode or starting the Spark context with the right parameters in client mode. See

Seeking help of UDF number-float converting

2019-07-01 Thread Danni Wu
Hello: I am using UDF to convert schema to JSON, and based on the JSON schema, when a schema has key “type” is “number”, I need to convert the input data to float, such as if an “income” type is number, and the input data is “100”, the output should be “100.0”. But the problem is if an original

Re: Map side join without broadcast

2019-07-01 Thread Chris Teoh
Hey there, I think it's overcomplicating the partitioning by explicitly specifying the partitioning when using the hash is the default behaviour of the partitioner in Spark. You could simply do a partitionBy and it would implement the hash partitioner by default. Let me know if I've

[pyspark 2.4.0] write with partitionBy fails due to file already exits

2019-07-01 Thread Rishi Shah
Hi All, I have a simple partition write like below: df = spark.read.parquet('read-location') df.write.partitionBy('col1').mode('overwrite').parquet('write-location') this fails after an hr with "file already exists (in .staging directory)" error. Not sure what am I doing wrong here.. --

Re: Implementing Upsert logic Through Streaming

2019-07-01 Thread Chris Teoh
Use a windowing function to get the "latest" version of the records from your incoming dataset and then update Oracle with the values, presumably via a JDBC connector. I hope that helps. On Mon, 1 Jul 2019 at 14:04, Sachit Murarka wrote: > Hi Chris, > > I have to make sure my DB has updated

State of support for dynamic allocation on K8s and possible CMs

2019-07-01 Thread Federico D'Ambrosio
Hello everyone, I wanted to ask what's the state of support of Spark dynamic allocation as of now, if there's any issue where I could track its advancement and missing features. We've just started evaluating possible alternatives for a production architectural setup for our use case, and dynamic