Re: Spark sql slowness in Spark 3.0.1

2022-04-14 Thread wilson
just curious, where to write? Anil Dasari wrote: We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to checkpoint data frames (intermediate data). DF write is very slow in 3.0.1 compared to 2.4.7. - To

Re: Monitoring with elastic search in spark job

2022-04-14 Thread wilson
Maybe you can give a look at this? https://github.com/banzaicloud/spark-metrics regards Xinyu Luan wrote: Can I get any suggestion or some examples for how to get the metrics correctly. - To unsubscribe e-mail:

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread wilson
what's the advantage of using reverse proxy for spark UI? Thanks On Tue, May 17, 2022 at 1:47 PM bo yang wrote: > Hi Spark Folks, > > I built a web reverse proxy to access Spark UI on Kubernetes (working > together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). > Want to

how to add a column for percent

2022-05-23 Thread wilson
hello how to add a column for percent for the current row of counted data? scala> df2.groupBy("_c1").count.withColumn("percent",f"${col(count)/df2.count}%.2f").show :30: error: type mismatch; This doesn't work. so please help. thanks.

Re: spark can't connect to kafka via sasl_ssl

2022-07-28 Thread wilson
updated: now I have resolved the connection issue (due to wrong arguments passed to sasl). but I meat another problem: 22/07/28 20:17:48 ERROR MicroBatchExecution: Query [id = 2a3bd87a-3a9f-4e54-a697-3d67cef77230, runId = 11c7ca0d-1bd9-4499-a613-6b6e8e8735ca] terminated with error

spark can't connect to kafka via sasl_ssl

2022-07-27 Thread wilson
Hello, my spark client program is as following: import org.apache.spark.sql.SparkSession object Sparkafka { def main(args:Array[String]):Unit = { val spark = SparkSession.builder.appName("Mykafka").getOrCreate() val df = spark .readStream .format("kafka")

spark null values calculation

2022-04-30 Thread wilson
my dataset has NULL included in the columns. do you know why the select results below have not consistent behavior? scala> dfs.select("cand_status").count() val res37: Long = 881793 scala> dfs.select("cand_status").where($"cand_status" =!= "NULL").count() val res38: Long = 383717 scala>

how spark handle the abnormal values

2022-05-01 Thread wilson
Hello my dataset has abnormal values in the column whose normal values are numeric. I can select them as: scala> df.select("up_votes").filter($"up_votes".rlike(regex)).show() +---+ | up_votes| +---+ | <| | <| |fx-| |

Re: how spark handle the abnormal values

2022-05-01 Thread wilson
pretty well. So I guess: 1) spark can make some auto translation from string to numeric when aggregating. 2) spark ignore those abnormal values automatically when calculating the relevant stuff. Am I right? thank you. wilson wilson wrote: my dataset has abnormal values in the column whose

Re: spark null values calculation

2022-05-01 Thread wilson
sorry i have found what's the reasons. for null I can not compare it directly. I have wrote a note for this. https://bigcount.xyz/how-spark-handles-null-and-abnormal-values.html Thanks. wilson wrote: do you know why the select results below have not consistent behavior

Re: how spark handle the abnormal values

2022-05-02 Thread wilson
Thanks Mich. But many original datasource has the abnormal values included from my experience. I already used rlike and filter to implement the data cleaning as my this writing: https://bigcount.xyz/calculate-urban-words-vote-in-spark.html What I am surprised is that spark does the string to

Re: Disable/Remove datasources in Spark

2022-05-05 Thread wilson
it's maybe impossible to disable that? user can run spark.read... to read any datasource he can reach. Aditya wrote: 2. But I am not able to figure out how to "disable" all other data sources - To unsubscribe e-mail:

Re: Disable/Remove datasources in Spark

2022-05-05 Thread wilson
though this is off-topic. but Apache Drill can does that. for instance, you can keep only the csv storage plugin in the configuration, but remove all other storage plugins. then users on drill can query csv only. regards Aditya wrote: So, is there a way for me to get a list of "leaf"

Re: Disable/Remove datasources in Spark

2022-05-05 Thread wilson
btw, I use drill to query webserver log only, b/c drill has that a storage plugin for httpd server log. but I found spark is also convenient to query webserver log for which I wrote a note: https://notes.4shield.net/how-to-query-webserver-log-with-spark.html Thanks wilson wrote: though

Re: groupby question

2022-05-05 Thread wilson
don't know what you were trying to express. it's better if you can give the sample dataset and the purpose you want to achieve, then we may give the right solution. Thanks Irene Markelic wrote: I have and rdd that I want to group according to some key, but it just doesn't work. I am a Scala

Re: Unsubscribe

2022-04-28 Thread wilson
please send the message to user-unsubscr...@spark.apache.org to unsubscribe. Ajay Thompson wrote: Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: java.lang.NoSuchMethodError - GraphX

2016-10-25 Thread Brian Wilson
; Chapter 6 of my book implements Dijkstra's Algorithm. The source code is > available to download for free. > https://www.manning.com/books/spark-graphx-in-action > <https://www.manning.com/books/spark-graphx-in-action> > > > > > From: Brian Wilson <bri

Re: java.lang.NoSuchMethodError - GraphX

2016-10-25 Thread Brian Wilson
correctly. What else could this be? Thanks Brian > On 25 Oct 2016, at 08:47, Brian Wilson <brian.wilson@gmail.com> wrote: > > Thank you Michael! This looks perfect but I have a `NoSuchMethodError` that I > cannot understand. > > I am trying to implement a weighted

Shortest path with directed and weighted graphs

2016-10-24 Thread Brian Wilson
I have been looking at the ShortestPaths function inbuilt with Spark here . Am I correct in saying there is no support for weighted graphs with this function? By that I mean that

Structured Streaming: stream-stream join with several equality conditions in a disjunction

2018-10-22 Thread WILSON Frank
Hi, I've just tried to do a stream-stream join with several equality conditions in a disjunction and got the following error: "terminated with exception: Stream stream joins without equality predicate is not supported;;" The query was in this sort of form:

Spark Kubernetes Architecture: Deployments vs Pods that create Pods

2019-01-29 Thread WILSON Frank
Hi, I've been playing around with Spark Kubernetes deployments over the past week and I'm curious to know why Spark deploys as a driver pod that creates more worker pods. I've read that it's normal to use Kubernetes Deployments to create a distributed service, so I am wondering why Spark just

Pyspark 2.4.4 window functions inconsistent

2021-11-04 Thread van wilson
I am using pyspark sql to run a sql script windows function to pull in (lead) data from the next row to populate the first row. It works reliably on Jupyter in VS code using anaconda pyspark 3.0.0. It produces different data results every time on aws emr using spark 2.4.4. Why? Is there any known

Unsubscribe

2022-03-16 Thread van wilson
> On Mar 16, 2022, at 7:38 AM, wrote: > > Thanks, Jayesh and all. I finally get the correlation data frame using agg > with list of functions. > I think the list of functions which generate a column should be more detailed > description. > > Liang > > - 原始邮件 - > 发件人:"Lalwani,

Re: Eclipse on spark

2015-01-26 Thread Luke Wilson-Mawer
I use this: http://scala-ide.org/ I also use Maven with this archetype: https://github.com/davidB/scala-archetype-simple. To be frank though, you should be fine using SBT. On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote: How to compile a Spark project in Scala IDE for