A quick unit test attempt didn't get far replacing map with as[], I'm only
working against 1.6.1 at the moment though, I was going to try 2.0 but I'm
having a hard time building a working spark-sql jar from source, the only
ones I've managed to make are intended for the full assembly fat jar.


Example of the error from calling joinWith as left_outer and then
.as[(Option[T], U]) where T and U are Int and Int.

[info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
scala.Tuple2),None)
[info] :- decodeusingserializer(input[0,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true)
[info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))]
[info] +- decodeusingserializer(input[1,
StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))],scala.Option,true)
[info]    +- input[1, StructType(StructField(_1,IntegerType,true),
StructField(_2,IntegerType,true))]

Cause: java.util.concurrent.ExecutionException: java.lang.Exception: failed
to compile: org.codehaus.commons.compiler.CompileException: File
'generated.java', Line 32, Column 60: No applicable constructor/method
found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
candidates are: "public static java.nio.ByteBuffer
java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
java.nio.ByteBuffer.wrap(byte[], int, int)"

The generated code is passing InternalRow objects into the ByteBuffer

Starting from two Datasets of types Dataset[(Int, Int)] with expression
$"left._1" === $"right._1". I'll have to spend some time getting a better
understanding of this analysis phase, but hopefully I can come up with
something.

On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Option should place nicely with encoders, but its always possible there
> are bugs.  I think those function signatures are slightly more expensive
> (one extra object allocation) and its not as java friendly so we probably
> don't want them to be the default.
>
> That said, I would like to enable that kind of sugar while still taking
> advantage of all the optimizations going on under the covers.  Can you get
> it to work if you use `as[...]` instead of `map`?
>
> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Ah thanks, I missed seeing the PR for
>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
>> null objects then I can implement methods that will map those back to
>> results that align closer to the RDD interface.
>>
>> As a follow on, I'm curious about thoughts regarding enriching the
>> Dataset join interface versus a package or users sugaring for themselves. I
>> haven't considered the implications of what the optimizations datasets,
>> tungsten, and/or bytecode gen can do now regarding joins so I may be
>> missing a critical benefit there around say avoiding Options in favor of
>> nulls. If nothing else, I guess Option doesn't have a first class Encoder
>> or DataType yet and maybe for good reasons.
>>
>> I did find the RDD join interface elegant, though. In the ideal world an
>> API comparable the following would be nice:
>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>>
>>
>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Thanks for the feedback.  I think this will address at least some of the
>>> problems you are describing: https://github.com/apache/spark/pull/13425
>>>
>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>>> rmarsc...@localytics.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've been working on transitioning from RDD to Datasets in our codebase
>>>> in anticipation of being able to leverage features of 2.0.
>>>>
>>>> I'm having a lot of difficulties with the impedance mismatches between
>>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>>>> types of the results from the right side of the join. This follows
>>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>>
>>>> Now with Dataset there is only joinWith where you specify the join
>>>> type, but it lost all the semantics of identifying missing data from outer
>>>> joins. I can write some enriched methods on Dataset with an implicit class
>>>> to abstract messiness away if Dataset nulled out all mismatching data from
>>>> an outer join, however the problem goes even further in that the values
>>>> aren't always null. Integer, for example, defaults to -1 instead of null.
>>>> Now it's completely ambiguous what data in the join was actually there
>>>> versus populated via this atypical semantic.
>>>>
>>>> Are there additional options available to work around this issue? I can
>>>> convert to RDD and back to Dataset but that's less than ideal.
>>>>
>>>> Thanks,
>>>> --
>>>> *Richard Marscher*
>>>> Senior Software Engineer
>>>> Localytics
>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>
>>>
>>>
>>
>>
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Reply via email to