Oops, screwed up my example. This is what it should be: case class Test(id: Int) val test = Seq( Test(1), Test(2), Test(3) ).toDS test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show val testMapped = test.map(t => t.copy(id = t.id + 1)) testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"t2.id ").show
On Fri, May 27, 2016 at 11:16 AM Tim Gautier <tim.gaut...@gmail.com> wrote: > I figured it out the trigger. Turns out it wasn't because I loaded it from > the database, it was because the first thing I do after loading is to lower > case all the strings. After a Dataset has been mapped, the resulting > Dataset can't be self joined. Here's a test case that illustrates the issue: > > case class Test(id: Int) > val test = Seq( > Test(1), > Test(2), > Test(3) > ).toDS > test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show // > <-- works fine > val testMapped = test.map(_.id + 1) // add 1 to each > testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === > $"t2.id").show > // <-- error > > > On Fri, May 27, 2016 at 10:44 AM Tim Gautier <tim.gaut...@gmail.com> > wrote: > >> I stand corrected. I just created a test table with a single int field to >> test with and the Dataset loaded from that works with no issues. I'll see >> if I can track down exactly what the difference might be. >> >> On Fri, May 27, 2016 at 10:29 AM Tim Gautier <tim.gaut...@gmail.com> >> wrote: >> >>> I'm using 1.6.1. >>> >>> I'm not sure what good fake data would do since it doesn't seem to have >>> anything to do with the data itself. It has to do with how the Dataset was >>> created. Both datasets have exactly the same data in them, but the one >>> created from a sql query fails where the one created from a Seq works. The >>> case class is just a few Option[Int] and Option[String] fields, nothing >>> special. >>> >>> Obviously there's some sort of difference between the two datasets, but >>> Spark tells me they're exactly the same type with exactly the same data, so >>> I couldn't create a test case without actually accessing a sql database. >>> >>> On Fri, May 27, 2016 at 10:15 AM Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Which release of Spark are you using ? >>>> >>>> Is it possible to come up with fake data that shows what you described ? >>>> >>>> Thanks >>>> >>>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <tim.gaut...@gmail.com> >>>> wrote: >>>> >>>>> Unfortunately I can't show exactly the data I'm using, but this is >>>>> what I'm seeing: >>>>> >>>>> I have a case class 'Product' that represents a table in our database. >>>>> I load that data via sqlContext.read.format("jdbc").options(...). >>>>> load.as[Product] and register it in a temp table 'product'. >>>>> >>>>> For testing, I created a new Dataset that has only 3 records in it: >>>>> >>>>> val ts = sqlContext.sql("select * from product where >>>>> product_catalog_id in (1, 2, 3)").as[Product] >>>>> >>>>> I also created another one using the same case class and data, but >>>>> from a sequence instead. >>>>> >>>>> val ds: Dataset[Product] = Seq( >>>>> Product(Some(1), ...), >>>>> Product(Some(2), ...), >>>>> Product(Some(3), ...) >>>>> ).toDS >>>>> >>>>> The spark shell tells me these are exactly the same type at this >>>>> point, but they don't behave the same. >>>>> >>>>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" === >>>>> $"ts2.product_catalog_id") >>>>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" === >>>>> $"ds2.product_catalog_id") >>>>> >>>>> Again, spark tells me these self joins return exactly the same type, >>>>> but when I do a .show on them, only the one created from a Seq works. The >>>>> one created by reading from the database throws this error: >>>>> >>>>> org.apache.spark.sql.AnalysisException: cannot resolve >>>>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id, >>>>> ...]; >>>>> >>>>> Is this a bug? Is there anyway to make the Dataset loaded from a table >>>>> behave like the one created from a sequence? >>>>> >>>>> Thanks, >>>>> Tim >>>>> >>>> >>>>