Re: Merge two dataframes

2021-05-19 Thread ayan guha
Hi Kushagra I still think this is a bad idea. By definition data in a dataframe or rdd is unordered, you are imposing an order where there is none, and if it works it will be by chance. For example a simple repartition may disrupt the row ordering. It is just too unpredictable. I would suggest

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
That generation of row_number() has to be performed through a window call and I don't think there is any way around it without orderBy() df1 = df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m") The problem is that without partitionBy()

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
Hi Kushagra, I believe you are referring to this warning below WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. I don't know an easy way around it. If the operation is only once you may be

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
Thanks a lot Mich , this works though I have to test for scalability. I have one question though . If we dont specify any column in partitionBy will it shuffle all the records in one executor ? Because this is what seems to be happening. Thanks once again ! Regards Kushagra Deep On Tue, May

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Ok, this should hopefully work as it uses row_number. from pyspark.sql.window import Window import pyspark.sql.functions as F from pyspark.sql.functions import row_number def spark_session(appName): return SparkSession.builder \ .appName(appName) \ .enableHiveSupport() \

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
The use case is to calculate PSI/CSI values . And yes the union is one to one row as you showed. On Tue, May 18, 2021, 20:39 Mich Talebzadeh wrote: > > Hi Kushagra, > > A bit late on this but what is the business use case for this merge? > > You have two data frames each with one column and you

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Hi Kushagra, A bit late on this but what is the business use case for this merge? You have two data frames each with one column and you want to UNION them in a certain way but the correlation is not known. In other words this UNION is as is? amount_6m | amount_9m 100

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
On Mon, May 17, 2021 at 2:31 PM Lalwani, Jayesh wrote: > > If the UDFs are computationally expensive, I wouldn't solve this problem with > UDFs at all. If they are working in an iterative manner, and assuming each > iteration is independent of other iterations (yes, I know that's a big >

Re: Merge two dataframes

2021-05-17 Thread Lalwani, Jayesh
If the UDFs are computationally expensive, I wouldn't solve this problem with UDFs at all. If they are working in an iterative manner, and assuming each iteration is independent of other iterations (yes, I know that's a big assumptiuon), I would think about exploding your dataframe to have a

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
In our case, these UDFs are quite expensive and worked on in an iterative manner, so being able to cache the two "sides" of the graphs independently will speed up the development cycle. Otherwise, if you modify foo() here, then you have to recompute bar and baz, even though they're unchanged.

Re: Merge two dataframes

2021-05-17 Thread Sean Owen
Why join here - just add two columns to the DataFrame directly? On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: > Anyone have ideas about the below Q? > > It seems to me that given that "diamond" DAG, that spark could see > that the rows haven't been shuffled/filtered, it could do some type

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
Anyone have ideas about the below Q? It seems to me that given that "diamond" DAG, that spark could see that the rows haven't been shuffled/filtered, it could do some type of "zip join" to push them together, but I've not been able to get a plan that doesn't do a hash/sort merge join Cheers

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
Hi, In the case where the left and right hand side share a common parent like: df = spark.read.someDataframe().withColumn('rownum', row_number()) df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')

Re: Merge two dataframes

2021-05-12 Thread Sean Owen
Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID. RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert the

Re: Merge two dataframes

2021-05-12 Thread kushagra deep
Thanks Raghvendra Will the ids for corresponding columns be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided into

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh
You can add an extra id column and perform an inner join. val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() +-+-+

Merge two dataframes

2021-05-12 Thread kushagra deep
Hi All, I have two dataframes df1 amount_6m 100 200 300 400 500 And a second data df2 below amount_9m 500 600 700 800 900 The number of rows is same in both dataframes. Can I merge the two dataframes to achieve below df df3 amount_6m | amount_9m 100

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my

Re: how to merge two dataframes

2015-10-30 Thread Ted Yu
How about the following ? scala> df.registerTempTable("df") scala> df1.registerTempTable("df1") scala> sql("select customer_id, uri, browser, epoch from df union select customer_id, uri, browser, epoch from df1").show() +---+-+---+-+ |customer_id|

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Not a bad idea I suspect but doesn't help me. I dumbed down the repro to ask for help. In reality one of my dataframes is a cassandra DF. So cassDF.registerTempTable("df1") registers the temp table in a different SQL Context (new CassandraSQLContext(sc)). scala> sql("select customer_id, uri,

Re: how to merge two dataframes

2015-10-30 Thread Ted Yu
I see - you were trying to union a non-Cassandra DF with Cassandra DF :-( On Fri, Oct 30, 2015 at 12:57 PM, Yana Kadiyska wrote: > Not a bad idea I suspect but doesn't help me. I dumbed down the repro to > ask for help. In reality one of my dataframes is a cassandra DF.

Re: how to merge two dataframes

2015-10-30 Thread Silvio Fiorito
l.com>" <yana.kadiy...@gmail.com<mailto:yana.kadiy...@gmail.com>> Date: Friday, October 30, 2015 at 3:57 PM To: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.