Hi Jacek, Thank you for the workaround. It is really working in this way: pos.as("p1").join(pos.as("p2")).filter($"p1.POSITION_ID0"===$"p2.POSITION_ID") I have checked, that in this way I get the same execution plan as for the join with renamed columns.
Best, Michael On Mon, Jan 15, 2018 at 10:33 PM, Jacek Laskowski <ja...@japila.pl> wrote: > Hi Michael, > > scala> spark.version > res0: String = 2.4.0-SNAPSHOT > > scala> val r1 = spark.range(1) > r1: org.apache.spark.sql.Dataset[Long] = [id: bigint] > > scala> r1.as("left").join(r1.as("right")).filter($"left.id" === > $"right.id").show > +---+---+ > | id| id| > +---+---+ > | 0| 0| > +---+---+ > > Am I missing something? When aliasing a table, use the identifier in column > refs (inside). > > > Pozdrawiam, > Jacek Laskowski > ---- > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma <mshte...@gmail.com> wrote: >> >> Hi Jacek & Gengliang, >> >> let's take a look at the following query: >> >> val pos = spark.read.parquet(prefix + "POSITION.parquet") >> pos.createOrReplaceTempView("POSITION") >> spark.sql("SELECT POSITION.POSITION_ID FROM POSITION POSITION JOIN >> POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID >> ").collect() >> >> This query is working for me right now using spark 2.2. >> >> Now we can try implementing the same logic with DataFrame API: >> >> pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect() >> >> I am getting the following error: >> >> "Join condition is missing or trivial. >> >> Use the CROSS JOIN syntax to allow cartesian products between these >> relations.;" >> >> I have tried using alias function, but without success: >> >> val pos2 = pos.alias("P2") >> pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect() >> >> This also leads us to the same error. >> Am I missing smth about the usage of alias? >> >> Now let's rename the columns: >> >> val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*) >> pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect() >> >> It works! >> >> There is one more really odd thing about all this: a colleague of mine >> has managed to get the same exception ("Join condition is missing or >> trivial") also using original SQL query, but I think he has been using >> empty tables. >> >> Thanks, >> Michael >> >> >> On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang >> <gengliang.w...@databricks.com> wrote: >> > Hi Michael, >> > >> > You can use `Explain` to see how your query is optimized. >> > >> > https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html >> > I believe your query is an actual cross join, which is usually very slow >> > in >> > execution. >> > >> > To get rid of this, you can set `spark.sql.crossJoin.enabled` as true. >> > >> > >> > 在 2018年1月15日,下午6:09,Jacek Laskowski <ja...@japila.pl> 写道: >> > >> > Hi Michael, >> > >> > -dev +user >> > >> > What's the query? How do you "fool spark"? >> > >> > Pozdrawiam, >> > Jacek Laskowski >> > ---- >> > https://about.me/JacekLaskowski >> > Mastering Spark SQL https://bit.ly/mastering-spark-sql >> > Spark Structured Streaming https://bit.ly/spark-structured-streaming >> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> > Follow me at https://twitter.com/jaceklaskowski >> > >> > On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <mshte...@gmail.com> >> > wrote: >> >> >> >> Hi all, >> >> >> >> If I try joining the table with itself using join columns, I am >> >> getting the following error: >> >> "Join condition is missing or trivial. Use the CROSS JOIN syntax to >> >> allow cartesian products between these relations.;" >> >> >> >> This is not true, and my join is not trivial and is not a real cross >> >> join. I am providing join condition and expect to get maybe a couple >> >> of joined rows for each row in the original table. >> >> >> >> There is a workaround for this, which implies renaming all the columns >> >> in source data frame and only afterwards proceed with the join. This >> >> allows us to fool spark. >> >> >> >> Now I am wondering if there is a way to get rid of this problem in a >> >> better way? I do not like the idea of renaming the columns because >> >> this makes it really difficult to keep track of the names in the >> >> columns in result data frames. >> >> Is it possible to deactivate this check? >> >> >> >> Thanks, >> >> Michael >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> > >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org