Re: Why my spark job STATE--> Running FINALSTATE --> Undefined.
Hi Shyam, It will be good if you mention what are you using the --master url as? Is it running on YARN, Mesos or Spark cluster? However, I faced such an issue in my earlier trials with spark, in which I created connections with a lot of external databases like Cassandra within the Driver (or main program of my app). After the job completed, my Main program/driver task never finished, after debugging I found out to be the reason as open sessions with Cassandra. Closing out those connections at the end of my main program helped resolve the problem. As you can guess, this issue was then irrespective of the Cluster manager used. Akshay Bhardwaj +91-97111-33849 On Tue, Jun 11, 2019 at 7:41 PM Shyam P wrote: > Hi, > Any clue why spark job goes into UNDEFINED state ? > > More detail are in the url. > > https://stackoverflow.com/questions/56545644/why-my-spark-sql-job-stays-in-state-runningfinalstatus-undefined > > > Appreciate your help. > > Regards, > Shyam >
What is the compatibility between releases?
Dear Community , >From what I understand , Spark uses a variation of Semantic Versioning[1] , but this information is not enough for me to clarify if it is compatible or not within versions. For example , if my cluster is running Spark 2.3.1 , can I develop using API additions in Spark 2.4? (higher order functions to give an example). What about the other way around? Typically , I assume that a job created in Spark 1.x will fail in Spark 2.x , but that's also something I would like to get a confirmation. Thank you for your help! [1] https://spark.apache.org/versioning-policy.html
[pyspark 2.3+] count distinct returns different value every time it is run on the same dataset
Hi All, countDistinct on dataframe returns different results every time it is run, I expect that when approxCountDistinct is used but even for countDistinct()? Is there a way to get accurate count using pyspark (deterministic result)? -- Regards, Rishi Shah
Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails
Hey Oliver, I am also facing the same issue on my kubernetes cluster(v1.11.5) on AWS with spark version 2.3.3, any luck in figuring out the root cause? On Fri, May 3, 2019 at 5:37 AM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi, > I did not try on another vendor, so I can't say if it's only related to > gke, and no, I did not notice anything on the kubelet or kube-dns > processes... > > Regards > > Le ven. 3 mai 2019 à 03:05, Li Gao a écrit : > >> hi Olivier, >> >> This seems a GKE specific issue? have you tried on other vendors ? Also >> on the kubelet nodes did you notice any pressure on the DNS side? >> >> Li >> >> >> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi everyone, >>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, >>> and sometimes while running these jobs a pretty bad thing happens, the >>> driver (in cluster mode) gets scheduled on Kubernetes and launches many >>> executor pods. >>> So far so good, but the k8s "Service" associated to the driver does not >>> seem to be propagated in terms of DNS resolution so all the executor fails >>> with a "spark-application-..cluster.svc.local" does not exists. >>> >>> All executors failing the driver should be failing too, but it considers >>> that it's a "pending" initial allocation and stay stuck forever in a loop >>> of "Initial job has not accepted any resources, please check Cluster UI" >>> >>> Has anyone else observed this king of behaviour ? >>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to >>> exist even after the "big refactoring" in the kubernetes cluster scheduler >>> backend. >>> >>> I can work on a fix / workaround but I'd like to check with you the >>> proper way forward : >>> >>>- Some processes (like the airflow helm recipe) rely on a "sleep >>>30s" before launching the dependent pods (that could be added to >>>/opt/entrypoint.sh used in the kubernetes packing) >>>- We can add a simple step to the init container trying to do the >>>DNS resolution and failing after 60s if it did not work >>> >>> But these steps won't change the fact that the driver will stay stuck >>> thinking we're still in the case of the Initial allocation delay. >>> >>> Thoughts ? >>> >>> -- >>> *Olivier Girardot* >>> o.girar...@lateral-thoughts.com >>> >> -- *Thanks,* *Prudhvi Chennuru.* The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
RE: Spark on Kubernetes - log4j.properties not read
That did the trick, Abhishek! Thanks for the explanation, that answered a lot of questions I had. Dave -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Why my spark job STATE--> Running FINALSTATE --> Undefined.
Hi, Any clue why spark job goes into UNDEFINED state ? More detail are in the url. https://stackoverflow.com/questions/56545644/why-my-spark-sql-job-stays-in-state-runningfinalstatus-undefined Appreciate your help. Regards, Shyam
Re: Fwd: [Spark SQL Thrift Server] Persistence errors with PostgreSQL and MySQL in 2.4.3
Hi folks, Does anyone know what is happening in this case? I tried both with MySQL and PostgreSQL and none of them finishes schema creation without error. It seems something has changed from 2.2. to 2.4 that broke schema generation for Hive Metastore. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
AWS EMR slow write to HDFS
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output committer. Is this the case, and if so - when will there be a version of EMR that uses Hadoop 3.1 ? The current version I'm using is 5.21. Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark kafka streaming job stopped
Please provide update if any one knows. On Monday, June 10, 2019, Amit Sharma wrote: > > We have spark kafka sreaming job running on standalone spark cluster. We > have below kafka architecture > > 1. Two cluster running on two data centers. > 2. There is LTM on top on each data center (load balance) > 3. There is GSLB on top of LTM. > > I observed when ever any of the node in kafka cluster is down our spark > stream job stopped. We are using GLSB url in our code to connect to Kafka > not the IP address. Please let me know is it expected behavior if not then > what config we need to change. > > Thanks > Amit >
Re: best docker image to use
Hi Marcelo, I'm used to work with https://github.com/jupyter/docker-stacks. There's the Scala+jupyter option too. Though there might be better option with Zeppelin too. Hth On Tue, 11 Jun 2019, 11:52 Marcelo Valle, wrote: > Hi, > > I would like to run spark shell + scala on a docker environment, just to > play with docker in development machine without having to install JVM + a > lot of things. > > Is there something as an "official docker image" I am recommended to use? > I saw some on docker hub, but it seems they are all contributions from > pro-active individuals. I wonder whether the group maintaining Apache Spark > also maintains some docker images for use cases like this? > > Thanks, > Marcelo. > > This email is confidential [and may be protected by legal privilege]. If > you are not the intended recipient, please do not copy or disclose its > content but contact the sender immediately upon receipt. > > KTech Services Ltd is registered in England as company number 10704940. > > Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, > United Kingdom >
Re: Read hdfs files in spark streaming
Hi Deepak, Please let us know - how you managed it ? Thanks, NJ On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma wrote: > Thanks All. > I managed to get this working. > Marking this thread as closed. > > On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma > wrote: > >> This is the project requirement , where paths are being streamed in kafka >> topic. >> Seems it's not possible using spark structured streaming. >> >> >> On Mon, Jun 10, 2019 at 3:59 PM Shyam P wrote: >> >>> Hi Deepak, >>> Why are you getting paths from kafka topic? any specific reason to do >>> so ? >>> >>> Regards, >>> Shyam >>> >>> On Mon, Jun 10, 2019 at 10:44 AM Deepak Sharma >>> wrote: >>> The context is different here. The file path are coming as messages in kafka topic. Spark streaming (structured) consumes form this topic. Now it have to get the value from the message , thus the path to file. read the json stored at the file location into another df. Thanks Deepak On Sun, Jun 9, 2019 at 11:03 PM vaquar khan wrote: > Hi Deepak, > > You can use textFileStream. > > https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html > > Plz start using stackoverflow to ask question to other ppl so get > benefits of answer > > > Regards, > Vaquar khan > > On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma > wrote: > >> I am using spark streaming application to read from kafka. >> The value coming from kafka message is path to hdfs file. >> I am using spark 2.x , spark.read.stream. >> What is the best way to read this path in spark streaming and then >> read the json stored at the hdfs path , may be using spark.read.json , >> into >> a df inside the spark streaming app. >> Thanks a lot in advance >> >> -- >> Thanks >> Deepak >> > -- Thanks Deepak www.bigdatabig.com www.keosha.net >>> >> >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> www.keosha.net >> > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net >
Re: Spark structured streaming leftOuter join not working as I expect
Got the point. If you would like to get "correct" output, you may need to set global watermark as "min", because watermark is not only used for evicting rows in state, but also discarding input rows later than watermark. Here you may want to be aware that there're two stateful operators which will receive inputs from previous stage and discard them via watermark before processing. Btw, you may also need to consider the difference of the concept of watermark between Spark and others: 1. Spark uses high watermark (picks highest event timestamp of input rows) even for single watermark whereas other frameworks use low watermark (picks lowest event timestamp of input rows). So you may always need to set enough delay on watermark. 2. Spark uses global watermark whereas other frameworks normally use operator-wise watermark. This is limitation of Spark (given outputs of previous stateful operator will become inputs of next stateful operator, they should have different watermark) and one of contributor proposes the approach [1] which would fit for Spark (unfortunately it haven't been reviewed by committers so long). Thanks, Jungtaek Lim (HeartSaVioR) 1. https://github.com/apache/spark/pull/23576 On Tue, Jun 11, 2019 at 7:06 AM Joe Ammann wrote: > Hi all > > it took me some time to get the issues extracted into a piece of > standalone code. I created the following gist > > https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 > > I has messages for 4 topics A/B/C/D and a simple Python program which > shows 6 use cases, with my expectations and observations with Spark 2.4.3 > > It would be great if you could have a look and check if I'm doing > something wrong, or this is indeed a limitation of Spark? > > On 6/5/19 5:35 PM, Jungtaek Lim wrote: > > Nice to hear you're investigating the issue deeply. > > > > Btw, if attaching code is not easy, maybe you could share > logical/physical plan on any batch: "detail" in SQL tab would show up the > plan as string. Plans from sequential batches would be much helpful - and > streaming query status in these batch (especially watermark) should be > helpful too. > > > > > -- > CU, Joe > -- Name : Jungtaek Lim Blog : http://medium.com/@heartsavior Twitter : http://twitter.com/heartsavior LinkedIn : http://www.linkedin.com/in/heartsavior
best docker image to use
Hi, I would like to run spark shell + scala on a docker environment, just to play with docker in development machine without having to install JVM + a lot of things. Is there something as an "official docker image" I am recommended to use? I saw some on docker hub, but it seems they are all contributions from pro-active individuals. I wonder whether the group maintaining Apache Spark also maintains some docker images for use cases like this? Thanks, Marcelo. This email is confidential [and may be protected by legal privilege]. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt. KTech Services Ltd is registered in England as company number 10704940. Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom
AW: Getting driver logs in Standalone Cluster
Hi Patrick, I guess the easiest way is to use log aggregation: https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application BR Jean-Michel Dr. Ing. h.c. F. Porsche Aktiengesellschaft Sitz der Gesellschaft: Stuttgart Registergericht: Amtsgericht Stuttgart HRB-Nr. 730623 Vorsitzender des Aufsichtsrats: Dr. Wolfgang Porsche Vorstand: Oliver Blume, Vorsitzender Lutz Meschke, stv. Vorsitzender Andreas Haffner, Detlev von Platen, Albrecht Reimold, Uwe-Karsten Städter, Michael Steiner Informationen zum Umgang mit Ihren Daten finden Sie in unsere Datenschutzhinweisen unter https://www.porsche.com/germany/porscheag-privacy/ Die vorgenannten Angaben werden jeder E-Mail automatisch hinzugefügt. Dies ist kein Anerkenntnis, dass es sich beim Inhalt dieser E-Mail um eine rechtsverbindliche Erklärung der Porsche AG handelt. Erklärungen, die die Porsche AG verpflichten, bedürfen jeweils der Unterschrift durch zwei zeichnungs- berechtigte Personen der AG. -Ursprüngliche Nachricht- Von: tkrol Gesendet: Freitag, 7. Juni 2019 16:22 An: user@spark.apache.org Betreff: Getting driver logs in Standalone Cluster Hey Guys, I am wondering what is the best way to get logs for driver in the cluster mode on standalone cluster? Normally I used to run client mode so I could capture logs from the console. Now I've started running jobs in cluster mode and obviously driver is running on worker and can't see the logs. I would like to store logs (preferably in hdfs), any easy way to do that? Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Pyspark 2.4] Best way to define activity within different time window
For grouping with each: look into grouping sets https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html Am Di., 11. Juni 2019 um 06:09 Uhr schrieb Rishi Shah < rishishah.s...@gmail.com>: > Thank you both for your input! > > To calculate moving average of active users, could you comment on whether > to go for RDD based implementation or dataframe? If dataframe, will window > function work here? > > In general, how would spark behave when working with dataframe with date, > week, month, quarter, year columns and groupie against each one one by one? > > > > On Sun, Jun 9, 2019 at 1:17 PM Jörn Franke wrote: > >> Depending on what accuracy is needed, hyperloglogs can be an interesting >> alternative >> https://en.m.wikipedia.org/wiki/HyperLogLog >> >> Am 09.06.2019 um 15:59 schrieb big data : >> >> From m opinion, Bitmap is the best solution for active users calculation. >> Other solution almost bases on count(distinct) calculation process, which >> is more slower. >> >> If you 've implemented Bitmap solution including how to build Bitmap, how >> to load Bitmap, then Bitmap is the best choice. >> 在 2019/6/5 下午6:49, Rishi Shah 写道: >> >> Hi All, >> >> Is there a best practice around calculating daily, weekly, monthly, >> quarterly, yearly active users? >> >> One approach is to create a window of daily bitmap and aggregate it based >> on period later. However I was wondering if anyone has a better approach to >> tackling this problem.. >> >> -- >> Regards, >> >> Rishi Shah >> >> > > -- > Regards, > > Rishi Shah >
Re: Spark 2.2 With Column usage
Hi, Why are you doing the following two lines? .select("id",lit(referenceFiltered)) .selectexpr( "id" ) What are you trying to achieve? What's lit and what's referenceFiltered? What's the difference between select and selectexpr? Please start at http://spark.apache.org/docs/latest/sql-programming-guide.html and then hop onto http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package to know the Spark API better. I'm sure you'll quickly find out the answer(s). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Sat, Jun 8, 2019 at 12:53 PM anbutech wrote: > Thanks Jacek Laskowski Sir.but i didn't get the point here > > please advise the below one are you expecting: > > dataset1.as("t1) > > join(dataset3.as("t2"), > > col(t1.col1) === col(t2.col1), JOINTYPE.Inner ) > > .join(dataset4.as("t3"), col(t3.col1) === col(t1.col1), > > JOINTYPE.Inner) > .select("id",lit(referenceFiltered)) > .selectexpr( > "id" > ) > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >