Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Khalid Mammadov
Use foreachBatch or foreach methods: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch On Wed, 10 Jan 2024, 17:42 PRASHANT L, wrote: > Hi > I have a use case where I need to process json payloads coming from Kafka > using structured

Re: Can not complete the read csv task

2023-10-14 Thread Khalid Mammadov
This command only defines a new DataFrame, in order to see some results you need to do something like merged_spark_data.show() on a new line. Regarding the error I think it's typical error that you get when you run Spark on Windows OS. You can suppress it using Winutils tool (Google it or ChatGPT

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
Perhaps that parquet file is corrupted or got that is in that folder? To check, try to read that file with pandas or other tools to see if you can read without Spark. On Wed, 5 Jul 2023, 07:25 elango vaidyanathan, wrote: > > Hi team, > > Any updates on this below issue > > On Mon, 3 Jul 2023 at

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Khalid Mammadov
Hey AN-TRUONG I have got some articles about this subject that should help. E.g. https://khalidmammadov.github.io/spark/spark_internals_rdd.html Also check other Spark Internals on web. Regards Khalid On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, wrote: > Thank you for your information, >

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Khalid Mammadov
I am not k8s expert but I think you got permission issue. Try 777 as an example to see if it works. On Mon, 13 Feb 2023, 21:42 karan alang, wrote: > Hello All, > > I'm trying to run a simple application on GKE (Kubernetes), and it is > failing: > Note : I have spark(bitnami spark chart)

Re: How to set a config for a single query?

2023-01-05 Thread Khalid Mammadov
Hi I believe there is a feature in Spark specifically for this purpose. You can create a new spark session and set those configs. Note that it's not the same as creating a separate driver processes with separate sessions, here you will still have the same SparkContext that works as a backend for

Re: [Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-28 Thread Khalid Mammadov
There is a feature in SparkContext to set localProperties (setLocalProperty) where you can set your Request ID and then using SparkListener instance read that ID with Job ID using onJobStart event. Hope this helps. On Tue, 27 Dec 2022, 13:04 Dhruv Toshniwal, wrote: > TL;Dr - >

Re: Moving to Spark 3x from Spark2

2022-09-01 Thread Khalid Mammadov
Hi Rajat There were a lot of changes between those versions and the only possible option to assess impact to do your testings unfortunately. Most probably you will have to do some changes to your codebase. Regards Khalid On Thu, 1 Sept 2022, 11:45 rajat kumar, wrote: > Hello Members, > > We

Re: Pyspark and multiprocessing

2022-07-21 Thread Khalid Mammadov
Pool.map() missing 1 required positional argument: 'iterable' > > So any hints on what to change? :) > > Spark has the pandas on spark API, and that is realy great. I prefer > pandas on spark API and pyspark over pandas. > > tor. 21. jul. 2022 kl. 09:18 skrev Khalid Mammadov &l

Re: Pyspark and multiprocessing

2022-07-21 Thread Khalid Mammadov
One quick observation is that you allocate all your local CPUs to Spark then execute that app with 10 Threads i.e 10 spark apps and so you will need 160cores in total as each will need 16CPUs IMHO. Wouldn't that create CPU bottleneck? Also on the side note, why you need Spark if you use that on

Re: flatMap for dataframe

2022-02-09 Thread Khalid Mammadov
One way is to split->explode->pivot These are column and Dataframe methods. Here are quick examples from web: https://www.google.com/amp/s/sparkbyexamples.com/spark/spark-split-dataframe-column-into-multiple-columns/amp/

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Khalid Mammadov
Your scala program does not use any Spark API hence faster that others. If you write the same code in pure Python I think it will be even faster than Scala program, especially taking into account these 2 programs runs on a single VM. Regarding Dataframe and RDD I would suggest to use Dataframes

Re: What are the most common operators for shuffle in Spark

2022-01-23 Thread Khalid Mammadov
I don't know actual implementation: But, to me it's still necessary as each worker reads data separately and reduces to get local distinct these will then need to be shuffled to find actual distinct. On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID, wrote: > Hello, > > I know some

Re: How to make batch filter

2022-01-02 Thread Khalid Mammadov
I think, you will get 1 partition as you have only one Executor/Worker (I.e. your local machine, a node). But your tasks (smallest unit of work item in Spark framework) will be processed in parallel on your 4 core. As Spark runs one task per core. You can also force to repartition it if you want

Re: docker image distribution in Kubernetes cluster

2021-12-08 Thread Khalid Mammadov
Hi Mitch IMO, it's done to provide most flexibility. So, some users can have limited/restricted version of the image or with some additional software that they use on the executors that is used during processing. So, in your case you only need to provide the first one since the other two configs

Re: Apache Spark 3.2.0 | Pyspark | Pycharm Setup

2021-11-17 Thread Khalid Mammadov
Hi Anil You dont need to download and Install Spark. It's enough to add pyspark to PyCharm as a package for your environment and start developing and testing locally. The thing is PySpark includes local Spark that is installed as part of pip install. When it comes to your particular issue. I

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-10 Thread Khalid Mammadov
Hi Mich I think you need to check your code. If code does not use PySpark API effectively you may get this. I.e. if you use pure Python/pandas api rather than Pyspark i.e. transform->transform->action. e.g df.select(..).withColumn(...)...count() Hope this helps to put you on right direction.

Re: Installing Distributed apache spark cluster with Cluster mode on Docker

2021-07-25 Thread Khalid Mammadov
if I can’t answer then someone else may have. I am CCing and you can reply all. Hope these all helps. Regards Khalid Sent from my iPad > On 25 Jul 2021, at 10:50, Dinakar Chennubotla > wrote: >  > Hi Khalid Mammadov, > > I am now, reworking from scratch i.e. on How to

Re: Installing Distributed apache spark cluster with Cluster mode on Docker

2021-07-24 Thread Khalid Mammadov
. https://github.com/khalidmammadov/spark_docker On Sat, Jul 24, 2021 at 4:07 PM Dinakar Chennubotla < chennu.bigd...@gmail.com> wrote: > Hi Khalid Mammadov, > > I tried the which says distributed mode Spark installation. But when run > below command it says " deployment mode =

Re: Installing Distributed apache spark cluster with Cluster mode on Docker

2021-07-24 Thread Khalid Mammadov
Databricks autoscaling works..). I am not sure k8s TBH, perhaps it's handled this more gracefully On Sat, Jul 24, 2021 at 3:38 PM Dinakar Chennubotla < chennu.bigd...@gmail.com> wrote: > Hi Khalid Mammadov, > > Thank you for your response, > Yes, I did, I built standalone ap

Re: Installing Distributed apache spark cluster with Cluster mode on Docker

2021-07-24 Thread Khalid Mammadov
Hi, Have you checked out docs? https://spark.apache.org/docs/latest/spark-standalone.html Thanks, Khalid On Sat, Jul 24, 2021 at 1:45 PM Dinakar Chennubotla < chennu.bigd...@gmail.com> wrote: > Hi All, > > I am Dinakar, Hadoop admin, > could someone help me here, > > 1. I have a DEV-POC task

Re: Missing stack function from SQL functions API

2021-06-15 Thread Khalid Mammadov
Hi David If you need alternative way to do it you can use below: df.select(expr("stack(2, 1,2,3)")) Or df.withColumn('stacked', expr("stack(2, 1,2,3)")) Thanks Khalid On Mon, 14 Jun 2021, 10:14 , wrote: > I noticed that the stack SQL function >

Re: convert java dataframe to pyspark dataframe

2021-03-31 Thread Khalid Mammadov
ed. So also wanted to ask if this is feasible and if yes do we need to send some special jars to executors so that it can execute udfs on the dataframe. On Wed, 31 Mar 2021 at 3:37 AM, Khalid Mammadov mailto:khalidmammad...@gmail.com>> wrote: Hi Aditya, I think you orig

Re: convert java dataframe to pyspark dataframe

2021-03-30 Thread Khalid Mammadov
Hi Aditya, I think you original question was as how to convert a DataFrame from Spark session created on Java/Scala to a DataFrame on a Spark session created from Python(PySpark). So, as I have answered on your SO question: There is a missing call to *entry_point* before calling getDf()

Re: Spark SQL Dataset and BigDecimal

2021-02-18 Thread Khalid Mammadov
the same goes for this BigDecimal case. So I personally would go with Scala types and compiler should do the rest for you. Cheers, Khalid Mammadov > On 18 Feb 2021, at 09:39, Ivan Petrov wrote: > >  > I'm fine with both. So does it make sense to use java.math.BigDecimal > everywher

Re: Spark SQL check timestamp with other table and update a column.

2020-11-21 Thread Khalid Mammadov
Hi, I am not sure if you were writing pseudo-code or real one but there were few issues in the sql. I have reproduced you example in the Spark REPL and all worked as expected and result is the one you need Please see below full code: ## *Spark 3.0.0* >>> a = spark.read.csv("tab1",

Re: mission statement : unified

2020-10-25 Thread Khalid Mammadov
Correct. Also as explained in the book LearningSpark2.0 by Databiricks: Unified Analytics While the notion of unification is not unique to Spark, it is a core component of its design philosophy and evolution. In November 2016, the Association for Computing Machinery (ACM) recognized Apache

Re: Let multiple jobs share one rdd?

2020-09-24 Thread Khalid Mammadov
Perhaps you can use Global Temp Views? https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createGlobalTempView On 24/09/2020 14:52, Gang Li wrote: Hi all, There are three jobs, among which the first rdd is the same. Can the first rdd be calculated once,