Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh
Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd,

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Raghavendra Ganesh
For simple array types setting encoder to ExpressionEncoder() should work. -- Raghavendra On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > Hi Spark Community, > > I'm trying to implement a custom Spark Aggregator (a subclass to > org.apache.spark.sql.expressions.Aggregator). Correct me if

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh
you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly

Re: how to add a column for percent

2022-05-23 Thread Raghavendra Ganesh
withColumn takes a column as the second argument, not string. If you want formatting before show() you can use the round() function. -- Raghavendra On Mon, May 23, 2022 at 11:35 AM wilson wrote: > hello > > how to add a column for percent for the current row of counted data? > > scala> >

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Raghavendra Ganesh
What is optimal depends on the context of the problem. Is the intent here to find the best solution for top n values with a group by ? Both the solutions look sub-optimal to me. Window function would be expensive as it needs an order by (which a top n solution shouldn't need). It would be best to

Re: how to classify column

2022-02-11 Thread Raghavendra Ganesh
You could use expr() function to achieve the same. .withColumn("newColumn",expr(s"case when score>3 then 'good' else 'bad' end")) -- Raghavendra On Fri, Feb 11, 2022 at 5:59 PM frakass wrote: > Hello > > I have a column whose value (Int type as score) is from 0 to 5. > I want to query that,

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh
You can add an extra id column and perform an inner join. val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() +-+-+

Re: How to Spawn Child Thread or Sub-jobs in a Spark Session

2020-12-04 Thread Raghavendra Ganesh
There should not be any need to explicitly make DF-2, DF-3 computation parallel. Spark generates execution plans and it can decide what can run in parallel (ideally you should see them running parallel in spark UI). You need to cache DF-1 if possible (either in memory/disk), otherwise computation

Re: Count distinct and driver memory

2020-10-19 Thread Raghavendra Ganesh
Spark provides multiple options for caching (including disk). Have you tried caching to disk ? -- Raghavendra On Mon, Oct 19, 2020 at 11:41 PM Lalwani, Jayesh wrote: > I was caching it because I didn't want to re-execute the DAG when I ran > the count query. If you have a spark application

Spark events log behavior in interactive vs batch job

2020-08-01 Thread Sriram Ganesh
Hi, I am working on writing spark events and application logs in the blob storage. I am using a similar path for writing spark events and application logs in blob storage. For example: *spark.eventLog.dir = wasb://@/logs* and *application log dir = wasb://@/logs/app/*. Since I'm using blob

Re: Monitor executor and task memory getting used

2019-10-24 Thread Sriram Ganesh
I was wrong here. I am using spark standalone cluster and I am not using YARN or MESOS. Is it possible to track spark execution memory?. On Mon, Oct 21, 2019 at 5:42 PM Sriram Ganesh wrote: > I looked into this. But I found it is possible like this > > https://github.com/apache/s

Re: Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh
Roman, wrote: > Take a look in this thread > <https://stackoverflow.com/questions/48768188/spark-execution-memory-monitoring#_=_> > > El lun., 21 oct. 2019 a las 13:45, Sriram Ganesh () > escribió: > >> Hi, >> >> I wanna monitor how much memory executor and

Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh
Hi, I wanna monitor how much memory executor and task used for a given job. Is there any direct method available for it which can be used to track this metric? -- *Sriram G* *Tech*

Re: unsubscribe

2017-02-23 Thread Ganesh
Thank you for cat facts. "A group of cats is called a clowder" MEEOW To unsubscribe please enter your credit card details followed by your pin. CAT-FACTS On 24/02/17 00:04, Donam Kim wrote: catunsub 2017-02-23 20:28 GMT+11:00 Ganesh Krishnan <m...@ganeshkrishnan.c

Re: unsubscribe

2017-02-23 Thread Ganesh Krishnan
Thank you for subscribing to "cat facts" Did you know that a cat's whiskers is used to determine if it can wiggle through a hole? To unsubscribe reply with keyword "catunsub" Thank you On Feb 23, 2017 8:25 PM, "Donam Kim" wrote: > unsubscribe >

Non-linear (curved?) regression line

2017-01-19 Thread Ganesh
Has anyone worked on non-linear/curved regression lines with Apache Spark? This seems to be such a trivial issue but I have given up after experimenting for nearly two weeks. The plot line is as below and the raw data in the table at the end. I just can't get Spark ML to give decent

VectorUDT and ml.Vector

2016-11-07 Thread Ganesh
is around 55 Gb of text data. Ganesh

Re: Spark Job not failing

2016-09-19 Thread sai ganesh
iled in 9.440 s >> 16/09/19 18:52:52 INFO DAGScheduler: Job 19 failed: insertIntoJDBC at >> sparkjob.scala:143, took 9.449118 s >> 16/09/19 18:52:52 INFO ApplicationMaster: Final app status: SUCCEEDED, >> exitCode: 0 >> 16/09/19 18:52:52 INFO SparkConte