I am curious if there is a way to call this so that it becomes a compile error rather than runtime error:
// Note mispelled count and name ds.groupBy($"name").count.select('nam, $"coun").show More specifically, what are the best type safety guarantees that Datasets provide? It seems like with Dataframes there is still the unsafety of specifying column names by string/symbol and expecting the type to be correct and exist, but if you do something like this then downstream code is safer: // This is Array[(String, Long)] instead of Array[sql.Row] ds.groupBy($"name").count.select('name.as[String], 'count.as [Long]).collect() Does that seem like a correct understanding of Datasets? On Sat, Jun 18, 2016 at 6:39 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Looks like it was my own fault. I had spark 2.0 cloned/built, but had the > spark shell in my path so somehow 1.6.1 was being used instead of 2.0. > Thanks > > On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> which version you use? >> I passed in 2.0-preview as follows; >> --- >> >> Spark context available as 'sc' (master = local[*], app id = >> local-1466234043659). >> >> Spark session available as 'spark'. >> >> Welcome to >> >> ____ __ >> >> / __/__ ___ _____/ /__ >> >> _\ \/ _ \/ _ `/ __/ '_/ >> >> /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-preview >> >> /_/ >> >> >> >> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java >> 1.8.0_31) >> >> Type in expressions to have them evaluated. >> >> Type :help for more information. >> >> >> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >> >> hive.metastore.schema.verification is not enabled so recording the schema >> version 1.2.0 >> >> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] >> >> scala> ds.groupBy($"_1").count.select($"_1", $"count").show >> >> +---+-----+ >> >> | _1|count| >> >> +---+-----+ >> >> | 1| 1| >> >> | 2| 1| >> >> +---+-----+ >> >> >> >> On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <ski.rodrig...@gmail.com >> > wrote: >> >>> I went ahead and downloaded/compiled Spark 2.0 to try your code snippet >>> Takeshi. It unfortunately doesn't compile. >>> >>> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >>> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] >>> >>> scala> ds.groupBy($"_1").count.select($"_1", $"count").show >>> <console>:28: error: type mismatch; >>> found : org.apache.spark.sql.ColumnName >>> required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row, >>> Long),?] >>> ds.groupBy($"_1").count.select($"_1", $"count").show >>> ^ >>> >>> I also gave a try to Xinh's suggestion using the code snippet below >>> (partially from spark docs) >>> scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), >>> Person("Pedro", 24), Person("Bob", 42)).toDS() >>> scala> ds.groupBy(_.name).count.select($"name".as[String]).show >>> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given >>> input columns: []; >>> scala> ds.groupBy(_.name).count.select($"_1".as[String]).show >>> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given >>> input columns: []; >>> scala> ds.groupBy($"name").count.select($"_1".as[String]).show >>> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input >>> columns: []; >>> >>> Looks like there are empty columns for some reason, the code below works >>> fine for the simple aggregate >>> scala> ds.groupBy(_.name).count.show >>> >>> Would be great to see an idiomatic example of using aggregates like >>> these mixed with spark.sql.functions. >>> >>> Pedro >>> >>> On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez < >>> ski.rodrig...@gmail.com> wrote: >>> >>>> Thanks Xinh and Takeshi, >>>> >>>> I am trying to avoid map since my impression is that this uses a Scala >>>> closure so is not optimized as well as doing column-wise operations is. >>>> >>>> Looks like the $ notation is the way to go, thanks for the help. Is >>>> there an explanation of how this works? I imagine it is a method/function >>>> with its name defined as $ in Scala? >>>> >>>> Lastly, are there prelim Spark 2.0 docs? If there isn't a good >>>> description/guide of using this syntax I would be willing to contribute >>>> some documentation. >>>> >>>> Pedro >>>> >>>> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro < >>>> linguin....@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> In 2.0, you can say; >>>>> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >>>>> ds.groupBy($"_1").count.select($"_1", $"count").show >>>>> >>>>> >>>>> // maropu >>>>> >>>>> >>>>> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.hu...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Pedro, >>>>>> >>>>>> In 1.6.1, you can do: >>>>>> >> ds.groupBy(_.uid).count().map(_._1) >>>>>> or >>>>>> >> ds.groupBy(_.uid).count().select($"value".as[String]) >>>>>> >>>>>> It doesn't have the exact same syntax as for DataFrame. >>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset >>>>>> >>>>>> It might be different in 2.0. >>>>>> >>>>>> Xinh >>>>>> >>>>>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez < >>>>>> ski.rodrig...@gmail.com> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its >>>>>>> released. >>>>>>> >>>>>>> I am running the aggregate code below where I have a dataset where >>>>>>> the row has a field uid: >>>>>>> >>>>>>> ds.groupBy(_.uid).count() >>>>>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: >>>>>>> string, _2: bigint] >>>>>>> >>>>>>> This works as expected, however, attempts to run select statements >>>>>>> after fails: >>>>>>> ds.groupBy(_.uid).count().select(_._1) >>>>>>> // error: missing parameter type for expanded function ((x$2) => >>>>>>> x$2._1) >>>>>>> ds.groupBy(_.uid).count().select(_._1) >>>>>>> >>>>>>> I have tried several variants, but nothing seems to work. Below is >>>>>>> the equivalent Dataframe code which works as expected: >>>>>>> df.groupBy("uid").count().select("uid") >>>>>>> >>>>>>> Thanks! >>>>>>> -- >>>>>>> Pedro Rodriguez >>>>>>> PhD Student in Distributed Machine Learning | CU Boulder >>>>>>> UC Berkeley AMPLab Alumni >>>>>>> >>>>>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>>>>>> Github: github.com/EntilZha | LinkedIn: >>>>>>> https://www.linkedin.com/in/pedrorodriguezscience >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> --- >>>>> Takeshi Yamamuro >>>>> >>>> >>>> >>>> >>>> -- >>>> Pedro Rodriguez >>>> PhD Student in Distributed Machine Learning | CU Boulder >>>> UC Berkeley AMPLab Alumni >>>> >>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>>> Github: github.com/EntilZha | LinkedIn: >>>> https://www.linkedin.com/in/pedrorodriguezscience >>>> >>>> >>> >>> >>> -- >>> Pedro Rodriguez >>> PhD Student in Distributed Machine Learning | CU Boulder >>> UC Berkeley AMPLab Alumni >>> >>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>> Github: github.com/EntilZha | LinkedIn: >>> https://www.linkedin.com/in/pedrorodriguezscience >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > > > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience