from:"Ganesh"

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh

Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh

Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd,

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Raghavendra Ganesh

For simple array types setting encoder to ExpressionEncoder() should work. -- Raghavendra On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > Hi Spark Community, > > I'm trying to implement a custom Spark Aggregator (a subclass to > org.apache.spark.sql.expressions.Aggregator). Correct me if

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh

you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly

Re: how to add a column for percent

2022-05-23 Thread Raghavendra Ganesh

withColumn takes a column as the second argument, not string. If you want formatting before show() you can use the round() function. -- Raghavendra On Mon, May 23, 2022 at 11:35 AM wilson wrote: > hello > > how to add a column for percent for the current row of counted data? > > scala> >

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Raghavendra Ganesh

What is optimal depends on the context of the problem. Is the intent here to find the best solution for top n values with a group by ? Both the solutions look sub-optimal to me. Window function would be expensive as it needs an order by (which a top n solution shouldn't need). It would be best to

Re: how to classify column

2022-02-11 Thread Raghavendra Ganesh

You could use expr() function to achieve the same. .withColumn("newColumn",expr(s"case when score>3 then 'good' else 'bad' end")) -- Raghavendra On Fri, Feb 11, 2022 at 5:59 PM frakass wrote: > Hello > > I have a column whose value (Int type as score) is from 0 to 5. > I want to query that,

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh

You can add an extra id column and perform an inner join. val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() +-+-+

Re: How to Spawn Child Thread or Sub-jobs in a Spark Session

2020-12-04 Thread Raghavendra Ganesh

There should not be any need to explicitly make DF-2, DF-3 computation parallel. Spark generates execution plans and it can decide what can run in parallel (ideally you should see them running parallel in spark UI). You need to cache DF-1 if possible (either in memory/disk), otherwise computation

Re: Count distinct and driver memory

2020-10-19 Thread Raghavendra Ganesh

Spark provides multiple options for caching (including disk). Have you tried caching to disk ? -- Raghavendra On Mon, Oct 19, 2020 at 11:41 PM Lalwani, Jayesh wrote: > I was caching it because I didn't want to re-execute the DAG when I ran > the count query. If you have a spark application

Spark events log behavior in interactive vs batch job

2020-08-01 Thread Sriram Ganesh

Hi, I am working on writing spark events and application logs in the blob storage. I am using a similar path for writing spark events and application logs in blob storage. For example: *spark.eventLog.dir = wasb://@/logs* and *application log dir = wasb://@/logs/app/*. Since I'm using blob

Re: Monitor executor and task memory getting used

2019-10-24 Thread Sriram Ganesh

I was wrong here. I am using spark standalone cluster and I am not using YARN or MESOS. Is it possible to track spark execution memory?. On Mon, Oct 21, 2019 at 5:42 PM Sriram Ganesh wrote: > I looked into this. But I found it is possible like this > > https://github.com/apache/s

Re: Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh

Roman, wrote: > Take a look in this thread > <https://stackoverflow.com/questions/48768188/spark-execution-memory-monitoring#_=_> > > El lun., 21 oct. 2019 a las 13:45, Sriram Ganesh () > escribió: > >> Hi, >> >> I wanna monitor how much memory executor and

Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh

Hi, I wanna monitor how much memory executor and task used for a given job. Is there any direct method available for it which can be used to track this metric? -- *Sriram G* *Tech*

Re: unsubscribe

2017-02-23 Thread Ganesh

Thank you for cat facts. "A group of cats is called a clowder" MEEOW To unsubscribe please enter your credit card details followed by your pin. CAT-FACTS On 24/02/17 00:04, Donam Kim wrote: catunsub 2017-02-23 20:28 GMT+11:00 Ganesh Krishnan <m...@ganeshkrishnan.c

Re: unsubscribe

2017-02-23 Thread Ganesh Krishnan

Thank you for subscribing to "cat facts" Did you know that a cat's whiskers is used to determine if it can wiggle through a hole? To unsubscribe reply with keyword "catunsub" Thank you On Feb 23, 2017 8:25 PM, "Donam Kim" wrote: > unsubscribe >

Non-linear (curved?) regression line

2017-01-19 Thread Ganesh

Has anyone worked on non-linear/curved regression lines with Apache Spark? This seems to be such a trivial issue but I have given up after experimenting for nearly two weeks. The plot line is as below and the raw data in the table at the end. I just can't get Spark ML to give decent

VectorUDT and ml.Vector

2016-11-07 Thread Ganesh

is around 55 Gb of text data. Ganesh

Re: Spark Job not failing

2016-09-19 Thread sai ganesh

iled in 9.440 s >> 16/09/19 18:52:52 INFO DAGScheduler: Job 19 failed: insertIntoJDBC at >> sparkjob.scala:143, took 9.449118 s >> 16/09/19 18:52:52 INFO ApplicationMaster: Final app status: SUCCEEDED, >> exitCode: 0 >> 16/09/19 18:52:52 INFO SparkConte

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

Re: Incremental Value dependents on another column of Data frame Spark

Re: Spark Aggregator with ARRAY input and ARRAY output

Re: [PySpark] Getting the best row from each group

Re: how to add a column for percent

Re: Difference between windowing functions and aggregation functions on big data

Re: how to classify column

Re: Merge two dataframes

Re: How to Spawn Child Thread or Sub-jobs in a Spark Session

Re: Count distinct and driver memory

Spark events log behavior in interactive vs batch job

Re: Monitor executor and task memory getting used

Re: Monitor executor and task memory getting used

Monitor executor and task memory getting used

Re: unsubscribe

Re: unsubscribe

Non-linear (curved?) regression line

VectorUDT and ml.Vector

Re: Spark Job not failing

19 matches

Site Navigation

Mail list logo

Footer information