Thanks Xinh and Takeshi, I am trying to avoid map since my impression is that this uses a Scala closure so is not optimized as well as doing column-wise operations is.
Looks like the $ notation is the way to go, thanks for the help. Is there an explanation of how this works? I imagine it is a method/function with its name defined as $ in Scala? Lastly, are there prelim Spark 2.0 docs? If there isn't a good description/guide of using this syntax I would be willing to contribute some documentation. Pedro On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Hi, > > In 2.0, you can say; > val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS > ds.groupBy($"_1").count.select($"_1", $"count").show > > > // maropu > > > On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.hu...@gmail.com> wrote: > >> Hi Pedro, >> >> In 1.6.1, you can do: >> >> ds.groupBy(_.uid).count().map(_._1) >> or >> >> ds.groupBy(_.uid).count().select($"value".as[String]) >> >> It doesn't have the exact same syntax as for DataFrame. >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset >> >> It might be different in 2.0. >> >> Xinh >> >> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <ski.rodrig...@gmail.com >> > wrote: >> >>> Hi All, >>> >>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its >>> released. >>> >>> I am running the aggregate code below where I have a dataset where the >>> row has a field uid: >>> >>> ds.groupBy(_.uid).count() >>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, >>> _2: bigint] >>> >>> This works as expected, however, attempts to run select statements after >>> fails: >>> ds.groupBy(_.uid).count().select(_._1) >>> // error: missing parameter type for expanded function ((x$2) => x$2._1) >>> ds.groupBy(_.uid).count().select(_._1) >>> >>> I have tried several variants, but nothing seems to work. Below is the >>> equivalent Dataframe code which works as expected: >>> df.groupBy("uid").count().select("uid") >>> >>> Thanks! >>> -- >>> Pedro Rodriguez >>> PhD Student in Distributed Machine Learning | CU Boulder >>> UC Berkeley AMPLab Alumni >>> >>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>> Github: github.com/EntilZha | LinkedIn: >>> https://www.linkedin.com/in/pedrorodriguezscience >>> >>> >> > > > -- > --- > Takeshi Yamamuro > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience