Hi Kushagra
I still think this is a bad idea. By definition data in a dataframe or rdd
is unordered, you are imposing an order where there is none, and if it
works it will be by chance. For example a simple repartition may disrupt
the row ordering. It is just too unpredictable.
I would suggest
That generation of row_number() has to be performed through a window call
and I don't think there is any way around it without orderBy()
df1 =
df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
The problem is that without partitionBy()
Hi Kushagra,
I believe you are referring to this warning below
WARN window.WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation.
I don't know an easy way around it. If the operation is only once you may
be
Thanks a lot Mich , this works though I have to test for scalability.
I have one question though . If we dont specify any column in partitionBy
will it shuffle all the records in one executor ? Because this is what
seems to be happening.
Thanks once again !
Regards
Kushagra Deep
On Tue, May
Ok, this should hopefully work as it uses row_number.
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number
def spark_session(appName):
return SparkSession.builder \
.appName(appName) \
.enableHiveSupport() \
The use case is to calculate PSI/CSI values . And yes the union is one to
one row as you showed.
On Tue, May 18, 2021, 20:39 Mich Talebzadeh
wrote:
>
> Hi Kushagra,
>
> A bit late on this but what is the business use case for this merge?
>
> You have two data frames each with one column and you
Hi Kushagra,
A bit late on this but what is the business use case for this merge?
You have two data frames each with one column and you want to UNION them in
a certain way but the correlation is not known. In other words this UNION
is as is?
amount_6m | amount_9m
100
On Mon, May 17, 2021 at 2:31 PM Lalwani, Jayesh wrote:
>
> If the UDFs are computationally expensive, I wouldn't solve this problem with
> UDFs at all. If they are working in an iterative manner, and assuming each
> iteration is independent of other iterations (yes, I know that's a big
>
If the UDFs are computationally expensive, I wouldn't solve this problem with
UDFs at all. If they are working in an iterative manner, and assuming each
iteration is independent of other iterations (yes, I know that's a big
assumptiuon), I would think about exploding your dataframe to have a
In our case, these UDFs are quite expensive and worked on in an
iterative manner, so being able to cache the two "sides" of the graphs
independently will speed up the development cycle. Otherwise, if you
modify foo() here, then you have to recompute bar and baz, even though
they're unchanged.
Why join here - just add two columns to the DataFrame directly?
On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote:
> Anyone have ideas about the below Q?
>
> It seems to me that given that "diamond" DAG, that spark could see
> that the rows haven't been shuffled/filtered, it could do some type
Anyone have ideas about the below Q?
It seems to me that given that "diamond" DAG, that spark could see
that the rows haven't been shuffled/filtered, it could do some type of
"zip join" to push them together, but I've not been able to get a plan
that doesn't do a hash/sort merge join
Cheers
Hi,
In the case where the left and right hand side share a common parent like:
df = spark.read.someDataframe().withColumn('rownum', row_number())
df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum')
df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')
Yeah I don't think that's going to work - you aren't guaranteed to get 1,
2, 3, etc. I think row_number() might be what you need to generate a join
ID.
RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You
could .zip two RDDs you get from DataFrames and manually convert the
Thanks Raghvendra
Will the ids for corresponding columns be same always ? Since
monotonic_increasing_id() returns a number based on partitionId and the row
number of the partition ,will it be same for corresponding columns? Also
is it guaranteed that the two dataframes will be divided into
You can add an extra id column and perform an inner join.
val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
+-+-+
Hi All,
I have two dataframes
df1
amount_6m
100
200
300
400
500
And a second data df2 below
amount_9m
500
600
700
800
900
The number of rows is same in both dataframes.
Can I merge the two dataframes to achieve below df
df3
amount_6m | amount_9m
100
Hi folks,
I have a need to "append" two dataframes -- I was hoping to use UnionAll
but it seems that this operation treats the underlying dataframes as
sequence of columns, rather than a map.
In particular, my problem is that the columns in the two DFs are not in the
same order --notice that my
How about the following ?
scala> df.registerTempTable("df")
scala> df1.registerTempTable("df1")
scala> sql("select customer_id, uri, browser, epoch from df union select
customer_id, uri, browser, epoch from df1").show()
+---+-+---+-+
|customer_id|
Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
ask for help. In reality one of my dataframes is a cassandra DF.
So cassDF.registerTempTable("df1") registers the temp table in a different
SQL Context (new CassandraSQLContext(sc)).
scala> sql("select customer_id, uri,
I see - you were trying to union a non-Cassandra DF with Cassandra DF :-(
On Fri, Oct 30, 2015 at 12:57 PM, Yana Kadiyska
wrote:
> Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
> ask for help. In reality one of my dataframes is a cassandra DF.
l.com>"
<yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>>
Date: Friday, October 30, 2015 at 3:57 PM
To: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.
22 matches
Mail list logo