Re: Why my spark job STATE--> Running FINALSTATE --> Undefined.

2019-06-11 Thread Akshay Bhardwaj
Hi Shyam, It will be good if you mention what are you using the --master url as? Is it running on YARN, Mesos or Spark cluster? However, I faced such an issue in my earlier trials with spark, in which I created connections with a lot of external databases like Cassandra within the Driver (or

What is the compatibility between releases?

2019-06-11 Thread email
Dear Community , >From what I understand , Spark uses a variation of Semantic Versioning[1] , but this information is not enough for me to clarify if it is compatible or not within versions. For example , if my cluster is running Spark 2.3.1 , can I develop using API additions in Spark

[pyspark 2.3+] count distinct returns different value every time it is run on the same dataset

2019-06-11 Thread Rishi Shah
Hi All, countDistinct on dataframe returns different results every time it is run, I expect that when approxCountDistinct is used but even for countDistinct()? Is there a way to get accurate count using pyspark (deterministic result)? -- Regards, Rishi Shah

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-06-11 Thread Prudhvi Chennuru (CONT)
Hey Oliver, I am also facing the same issue on my kubernetes cluster(v1.11.5) on AWS with spark version 2.3.3, any luck in figuring out the root cause? On Fri, May 3, 2019 at 5:37 AM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi, > I did not try on

RE: Spark on Kubernetes - log4j.properties not read

2019-06-11 Thread Dave Jaffe
That did the trick, Abhishek! Thanks for the explanation, that answered a lot of questions I had. Dave -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Why my spark job STATE--> Running FINALSTATE --> Undefined.

2019-06-11 Thread Shyam P
Hi, Any clue why spark job goes into UNDEFINED state ? More detail are in the url. https://stackoverflow.com/questions/56545644/why-my-spark-sql-job-stays-in-state-runningfinalstatus-undefined Appreciate your help. Regards, Shyam

Re: Fwd: [Spark SQL Thrift Server] Persistence errors with PostgreSQL and MySQL in 2.4.3

2019-06-11 Thread rmartine
Hi folks, Does anyone know what is happening in this case? I tried both with MySQL and PostgreSQL and none of them finishes schema creation without error. It seems something has changed from 2.2. to 2.4 that broke schema generation for Hive Metastore. -- Sent from:

AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output committer. Is this the case, and if so - when will

Re: Spark kafka streaming job stopped

2019-06-11 Thread Amit Sharma
Please provide update if any one knows. On Monday, June 10, 2019, Amit Sharma wrote: > > We have spark kafka sreaming job running on standalone spark cluster. We > have below kafka architecture > > 1. Two cluster running on two data centers. > 2. There is LTM on top on each data center (load

Re: best docker image to use

2019-06-11 Thread Riccardo Ferrari
Hi Marcelo, I'm used to work with https://github.com/jupyter/docker-stacks. There's the Scala+jupyter option too. Though there might be better option with Zeppelin too. Hth On Tue, 11 Jun 2019, 11:52 Marcelo Valle, wrote: > Hi, > > I would like to run spark shell + scala on a docker

Re: Read hdfs files in spark streaming

2019-06-11 Thread nitin jain
Hi Deepak, Please let us know - how you managed it ? Thanks, NJ On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma wrote: > Thanks All. > I managed to get this working. > Marking this thread as closed. > > On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma > wrote: > >> This is the project requirement

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-11 Thread Jungtaek Lim
Got the point. If you would like to get "correct" output, you may need to set global watermark as "min", because watermark is not only used for evicting rows in state, but also discarding input rows later than watermark. Here you may want to be aware that there're two stateful operators which will

best docker image to use

2019-06-11 Thread Marcelo Valle
Hi, I would like to run spark shell + scala on a docker environment, just to play with docker in development machine without having to install JVM + a lot of things. Is there something as an "official docker image" I am recommended to use? I saw some on docker hub, but it seems they are all

AW: Getting driver logs in Standalone Cluster

2019-06-11 Thread Lourier, Jean-Michel (FIX1)
Hi Patrick, I guess the easiest way is to use log aggregation: https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application BR Jean-Michel Dr. Ing. h.c. F. Porsche Aktiengesellschaft Sitz der Gesellschaft: Stuttgart Registergericht: Amtsgericht Stuttgart HRB-Nr.

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-11 Thread Georg Heiler
For grouping with each: look into grouping sets https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html Am Di., 11. Juni 2019 um 06:09 Uhr schrieb Rishi Shah < rishishah.s...@gmail.com>: > Thank you both for your input! > > To calculate moving average

Re: Spark 2.2 With Column usage

2019-06-11 Thread Jacek Laskowski
Hi, Why are you doing the following two lines? .select("id",lit(referenceFiltered)) .selectexpr( "id" ) What are you trying to achieve? What's lit and what's referenceFiltered? What's the difference between select and selectexpr? Please start at