Understanding about joins in spark

2022-06-27 Thread Sid
Hi Team, As per my understanding, assume it to be a large dataset. When we apply joins, data from different executors are shuffled in such a way that the same "keys" are landed in one partition. So, this is done for both the dataframes, right? For eg: Key A for df1 will be sorted and kept in

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-12 Thread Tathagata Das
You have understood the problem right. However note that your interpretation of the output *(K, leftValue, null), **(K, leftValue, rightValue1), **(K, leftValue, rightValue2)* is subject to the knowledge of the semantics of the join. That if you are processing the output rows *manually*, you are

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-10 Thread kant kodali
I will give an attempt to answer this. since rightValue1 and rightValue2 have the same key "K"(two matches) why would it ever be the case *rightValue2* replacing *rightValue1* replacing *null? *Moreover, why does user need to care? The result in this case (after getting 2 matches) should be

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Tathagata Das
This doc is unrelated to the stream-stream join we added in Structured Streaming. :) That said we added append mode first because it easier to reason about the semantics of append mode especially in the context of outer joins. You output a row only when you know it wont be changed ever. The

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Gourav Sengupta
super interesting. On Wed, Mar 7, 2018 at 11:44 AM, kant kodali wrote: > It looks to me that the StateStore described in this doc > > Actually > has full outer join and every other join

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-07 Thread kant kodali
It looks to me that the StateStore described in this doc Actually has full outer join and every other join is a filter of that. Also the doc talks about update mode but looks like Spark 2.3 ended up with append

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Sorry I meant Spark 2.4 in my previous email On Tue, Mar 6, 2018 at 9:15 PM, kant kodali wrote: > Hi TD, > > I agree I think we are better off either with a full fix or no fix. I am > ok with the complete fix being available in master or some branch. I guess > the solution

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Hi TD, I agree I think we are better off either with a full fix or no fix. I am ok with the complete fix being available in master or some branch. I guess the solution for me is to just build from the source. On a similar note, I am not finding any JIRA tickets related to full outer joins and

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread Tathagata Das
I thought about it. I am not 100% sure whether this fix should go into 2.3.1. There are two parts to this bug fix to enable self-joins. 1. Enabling deduping of leaf logical nodes by extending MultInstanceRelation - This is safe to be backported into the 2.3 branch as it does not touch

Joins in spark for large tables

2018-02-28 Thread KhajaAsmath Mohammed
Hi, Is there any best approach to reduce shuffling in spark. I have two tables and both of them are large. any suggestions? I saw only about broadcast but that will not work in my case. Thanks, Asmath

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread kant kodali
Hi TD, I pulled your commit that is listed on this ticket https://issues.apache.org/jira/browse/SPARK-23406 specifically I did the following steps and self joins work after I cherry-pick your commit! Good Job! I was hoping it will be part of 2.3.0 but looks like it is targeted for 2.3.1 :( git

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread Tathagata Das
Hey, Thanks for testing out stream-stream joins and reporting this issue. I am going to take a look at this. TD On Tue, Feb 20, 2018 at 8:20 PM, kant kodali wrote: > if I change it to the below code it works. However, I don't believe it is > the solution I am looking

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
if I change it to the below code it works. However, I don't believe it is the solution I am looking for. I want to be able to do it in raw SQL and moreover, If a user gives a big chained raw spark SQL join query I am not even sure how to make copies of the dataframe to achieve the self-join. Is

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
If I change it to this On Tue, Feb 20, 2018 at 7:52 PM, kant kodali wrote: > Hi All, > > I have the following code > > import org.apache.spark.sql.streaming.Trigger > > val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", >

what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
Hi All, I have the following code import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table")

Re: DataFrame joins with Spark-Java

2017-11-29 Thread Rishi Mishra
Hi Sushma, can you try as below with a left anti join ..In my example name & id consists of a key. df1.alias("a").join(df2.alias("b"), col("a.name").equalTo(col("b.name")) .and(col("a.id").equalTo(col("b.id"))) , "left_anti").selectExpr("name", "id").show(10,

DataFrame joins with Spark-Java

2017-11-29 Thread sushma spark
Dear Friends, I am new to spark DataFrame. My requirement is i have a dataframe1 contains the today's records and dataframe2 contains yesterday's records. I need to compare the today's records with yesterday's records and find out new records which are not exists in the yesterday's records based

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Sorry, I had a typo I mean repartitionby("fieldofjoin) El 2 may. 2017 9:44 p. m., "KhajaAsmath Mohammed" escribió: Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry)

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Hi Angel, I am trying using the below code but i dont see partition on the dataframe. val iftaGPSLocation_df = sqlContext.sql(iftaGPSLocQry) import sqlContext._ import sqlContext.implicits._ datapoint_prq_df.join(geoCacheLoc_df) Val tableA =

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Have you tried to make partition by join's field and run it by segments, filtering both tables at the same segments of data? Example: Val tableA = DfA.partitionby("joinField").filter("firstSegment") Val tableB= DfB.partitionby("joinField").filter("firstSegment") TableA.join(TableB) El 2

Re: Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Table 1 (192 GB) is partitioned by year and month ... 192 GB of data is for one month i.e. for April Table 2: 92 GB not partitioned . I have to perform join on these tables now. On Tue, May 2, 2017 at 1:27 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, > > Is the

Re: Joins in Spark

2017-05-02 Thread Angel Francisco Orta
Hello, Is the tables partitioned? If yes, what is the partition field? Thanks El 2 may. 2017 8:22 p. m., "KhajaAsmath Mohammed" escribió: Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1:

Joins in Spark

2017-05-02 Thread KhajaAsmath Mohammed
Hi, I am trying to join two big tables in spark and the job is running for quite a long time without any results. Table 1: 192GB Table 2: 92 GB Does anyone have better solution to get the results fast? Thanks, Asmath

RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-21 Thread Ravi Aggarwal
[mailto:ianoconn...@gmail.com] On Behalf Of Ian O'Connell Sent: Wednesday, July 20, 2016 11:05 PM To: Ravi Aggarwal <raagg...@adobe.com> Cc: Ted Yu <yuzhih...@gmail.com>; user <user@spark.apache.org> Subject: Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spa

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-20 Thread Ian O'Connell
; > Can we deduce anything from this? > > > > Thanks > > Ravi > > *From:* Ravi Aggarwal > *Sent:* Friday, June 10, 2016 12:31 PM > *To:* 'Ted Yu' <yuzhih...@gmail.com> > *Cc:* user <user@spark.apache.org> > *Subject:* RE: OutOfMemory when doing joins i

RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-14 Thread Ravi Aggarwal
t-merge join. Can we deduce anything from this? Thanks Ravi From: Ravi Aggarwal Sent: Friday, June 10, 2016 12:31 PM To: 'Ted Yu' <yuzhih...@gmail.com> Cc: user <user@spark.apache.org> Subject: RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2 Hi T

RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-10 Thread Ravi Aggarwal
CatalystTypeConverters.convertToScala( Cast(Literal(value._2), colDataType).eval(), colDataType) }).toArray Row(recordFields: _*) } rowRdd } } Thanks Ravi From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Thursday, June 9, 2016 7:56 PM To: Ravi Aggarwal <ra

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-09 Thread Ted Yu
t; Could someone please guide me through next steps? > > Thanks > Ravi > Computer Scientist @ Adobe > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/OutO

OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-09 Thread raaggarw
k-user-list.1001560.n3.nabble.com/OutOfMemory-when-doing-joins-in-spark-2-0-while-same-code-runs-fine-in-spark-1-5-2-tp27124.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail:

Joins in Spark

2016-03-19 Thread Stuti Awasthi
Hi All, I have to join 2 files both not very big say few MBs only but the result can be huge say generating 500GBs to TBs of data. Now I have tried using spark Join() function but Im noticing that join is executing on only 1 or 2 nodes at the max. Since I have a cluster size of 5 nodes , I

Re: Joins in Spark

2016-03-19 Thread Rishi Mishra
My suspect is your input file partitions are small. Hence small number of tasks are started. Can you provide some more details like how you load the files and how the result size is around 500GBs ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

Re: Multiple joins in Spark

2015-10-20 Thread Xiao Li
Are you using hiveContext? First, build your Spark using the following command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package Then, try this sample program object SimpleApp { case class Individual(name: String, surname: String, birthDate:

Re: Multiple joins in Spark

2015-10-20 Thread Shyam Parimal Katti
When I do the steps above and run a query like this: sqlContext.sql("select * from ...") I get exception: org.apache.spark.sql.AnalysisException: Non-local session path expected to be non-null; at org.apache.spark.sql.hive.HiveQL$.createPlan(HiveQl.scala:260) . I cannot paste the

Multiple joins in Spark

2015-10-16 Thread Shyam Parimal Katti
Hello All, I have a following SQL query like this: select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id = b.a_id join table_c c on b.b_id = c.b_id In scala i have done this so far: table_a_rdd = sc.textFile(...) table_b_rdd = sc.textFile(...) table_c_rdd = sc.textFile(...)

Re: Multiple joins in Spark

2015-10-16 Thread Xiao Li
Hi, Shyam, You still can use SQL to do the same thing in Spark: For example, val df1 = sqlContext.createDataFrame(rdd) val df2 = sqlContext.createDataFrame(rdd2) val df3 = sqlContext.createDataFrame(rdd3) df1.registerTempTable("tab1") df2.registerTempTable("tab2")

Re: Multiple joins in Spark

2015-10-16 Thread Xiao Li
Hi, Shyam, The method registerTempTable is to register a [DataFrame as a temporary table in the Catalog using the given table name. In the Catalog, Spark maintains a concurrent hashmap, which contains the pair of the table names and the logical plan. For example, when we submit the following

Re: Support for skewed joins in Spark

2015-05-04 Thread ๏̯͡๏
into the disk temporarily and use disk files to do the join. Best Regards, Shixiong Zhu 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com: Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner

Re: Support for skewed joins in Spark

2015-03-12 Thread Shixiong Zhu
. Best Regards, Shixiong Zhu 2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com: Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too

Re: Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com: Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too large to fit in a single partition. I

Support for skewed joins in Spark

2015-03-12 Thread Soila Pertet Kavulya
Does Spark support skewed joins similar to Pig which distributes large keys over multiple partitions? I tried using the RangePartitioner but I am still experiencing failures because some keys are too large to fit in a single partition. I cannot use broadcast variables to work-around this because

Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You

Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You

Re: Joins in Spark

2014-12-22 Thread madhu phatak
Hi, You can map your vertices rdd as follow val pairVertices = verticesRDD.map(vertice = (vertice,null)) the above gives you a pairRDD. After join make sure that you remove superfluous null value. On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have two

Joins in Spark

2014-12-22 Thread pradhandeep
-list.1001560.n3.nabble.com/Joins-in-Spark-tp20819.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Fwd: Joins in Spark

2014-12-22 Thread Deep Pradhan
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD with each vertex padded with value null. But I have to take a three way join of these two RDD and I have only one common attribute in these two RDDs. How can I go about doing the three join?