A quick unit test attempt didn't get far replacing map with as[], I'm only working against 1.6.1 at the moment though, I was going to try 2.0 but I'm having a hard time building a working spark-sql jar from source, the only ones I've managed to make are intended for the full assembly fat jar.
Example of the error from calling joinWith as left_outer and then .as[(Option[T], U]) where T and U are Int and Int. [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0, StructType(StructField(_1,IntegerType,true), StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1, StructType(StructField(_1,IntegerType,true), StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class scala.Tuple2),None) [info] :- decodeusingserializer(input[0, StructType(StructField(_1,IntegerType,true), StructField(_2,IntegerType,true))],scala.Option,true) [info] : +- input[0, StructType(StructField(_1,IntegerType,true), StructField(_2,IntegerType,true))] [info] +- decodeusingserializer(input[1, StructType(StructField(_1,IntegerType,true), StructField(_2,IntegerType,true))],scala.Option,true) [info] +- input[1, StructType(StructField(_1,IntegerType,true), StructField(_2,IntegerType,true))] Cause: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 32, Column 60: No applicable constructor/method found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)" The generated code is passing InternalRow objects into the ByteBuffer Starting from two Datasets of types Dataset[(Int, Int)] with expression $"left._1" === $"right._1". I'll have to spend some time getting a better understanding of this analysis phase, but hopefully I can come up with something. On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mich...@databricks.com> wrote: > Option should place nicely with encoders, but its always possible there > are bugs. I think those function signatures are slightly more expensive > (one extra object allocation) and its not as java friendly so we probably > don't want them to be the default. > > That said, I would like to enable that kind of sugar while still taking > advantage of all the optimizations going on under the covers. Can you get > it to work if you use `as[...]` instead of `map`? > > On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher < > rmarsc...@localytics.com> wrote: > >> Ah thanks, I missed seeing the PR for >> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became >> null objects then I can implement methods that will map those back to >> results that align closer to the RDD interface. >> >> As a follow on, I'm curious about thoughts regarding enriching the >> Dataset join interface versus a package or users sugaring for themselves. I >> haven't considered the implications of what the optimizations datasets, >> tungsten, and/or bytecode gen can do now regarding joins so I may be >> missing a critical benefit there around say avoiding Options in favor of >> nulls. If nothing else, I guess Option doesn't have a first class Encoder >> or DataType yet and maybe for good reasons. >> >> I did find the RDD join interface elegant, though. In the ideal world an >> API comparable the following would be nice: >> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06 >> >> >> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Thanks for the feedback. I think this will address at least some of the >>> problems you are describing: https://github.com/apache/spark/pull/13425 >>> >>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher < >>> rmarsc...@localytics.com> wrote: >>> >>>> Hi, >>>> >>>> I've been working on transitioning from RDD to Datasets in our codebase >>>> in anticipation of being able to leverage features of 2.0. >>>> >>>> I'm having a lot of difficulties with the impedance mismatches between >>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like >>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option >>>> types of the results from the right side of the join. This follows >>>> idiomatic Scala avoiding nulls and was easy to work with. >>>> >>>> Now with Dataset there is only joinWith where you specify the join >>>> type, but it lost all the semantics of identifying missing data from outer >>>> joins. I can write some enriched methods on Dataset with an implicit class >>>> to abstract messiness away if Dataset nulled out all mismatching data from >>>> an outer join, however the problem goes even further in that the values >>>> aren't always null. Integer, for example, defaults to -1 instead of null. >>>> Now it's completely ambiguous what data in the join was actually there >>>> versus populated via this atypical semantic. >>>> >>>> Are there additional options available to work around this issue? I can >>>> convert to RDD and back to Dataset but that's less than ideal. >>>> >>>> Thanks, >>>> -- >>>> *Richard Marscher* >>>> Senior Software Engineer >>>> Localytics >>>> Localytics.com <http://localytics.com/> | Our Blog >>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> >>>> | Facebook <http://facebook.com/localytics> | LinkedIn >>>> <http://www.linkedin.com/company/1148792?trk=tyah> >>>> >>> >>> >> >> >> -- >> *Richard Marscher* >> Senior Software Engineer >> Localytics >> Localytics.com <http://localytics.com/> | Our Blog >> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | >> Facebook <http://facebook.com/localytics> | LinkedIn >> <http://www.linkedin.com/company/1148792?trk=tyah> >> > > -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>