Looks like it was my own fault. I had spark 2.0 cloned/built, but had the spark shell in my path so somehow 1.6.1 was being used instead of 2.0. Thanks
On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > which version you use? > I passed in 2.0-preview as follows; > --- > > Spark context available as 'sc' (master = local[*], app id = > local-1466234043659). > > Spark session available as 'spark'. > > Welcome to > > ____ __ > > / __/__ ___ _____/ /__ > > _\ \/ _ \/ _ `/ __/ '_/ > > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-preview > > /_/ > > > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_31) > > Type in expressions to have them evaluated. > > Type :help for more information. > > > scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS > > hive.metastore.schema.verification is not enabled so recording the schema > version 1.2.0 > > ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] > > scala> ds.groupBy($"_1").count.select($"_1", $"count").show > > +---+-----+ > > | _1|count| > > +---+-----+ > > | 1| 1| > > | 2| 1| > > +---+-----+ > > > > On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >> I went ahead and downloaded/compiled Spark 2.0 to try your code snippet >> Takeshi. It unfortunately doesn't compile. >> >> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] >> >> scala> ds.groupBy($"_1").count.select($"_1", $"count").show >> <console>:28: error: type mismatch; >> found : org.apache.spark.sql.ColumnName >> required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row, >> Long),?] >> ds.groupBy($"_1").count.select($"_1", $"count").show >> ^ >> >> I also gave a try to Xinh's suggestion using the code snippet below >> (partially from spark docs) >> scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), >> Person("Pedro", 24), Person("Bob", 42)).toDS() >> scala> ds.groupBy(_.name).count.select($"name".as[String]).show >> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input >> columns: []; >> scala> ds.groupBy(_.name).count.select($"_1".as[String]).show >> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input >> columns: []; >> scala> ds.groupBy($"name").count.select($"_1".as[String]).show >> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input >> columns: []; >> >> Looks like there are empty columns for some reason, the code below works >> fine for the simple aggregate >> scala> ds.groupBy(_.name).count.show >> >> Would be great to see an idiomatic example of using aggregates like these >> mixed with spark.sql.functions. >> >> Pedro >> >> On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <ski.rodrig...@gmail.com >> > wrote: >> >>> Thanks Xinh and Takeshi, >>> >>> I am trying to avoid map since my impression is that this uses a Scala >>> closure so is not optimized as well as doing column-wise operations is. >>> >>> Looks like the $ notation is the way to go, thanks for the help. Is >>> there an explanation of how this works? I imagine it is a method/function >>> with its name defined as $ in Scala? >>> >>> Lastly, are there prelim Spark 2.0 docs? If there isn't a good >>> description/guide of using this syntax I would be willing to contribute >>> some documentation. >>> >>> Pedro >>> >>> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin....@gmail.com >>> > wrote: >>> >>>> Hi, >>>> >>>> In 2.0, you can say; >>>> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >>>> ds.groupBy($"_1").count.select($"_1", $"count").show >>>> >>>> >>>> // maropu >>>> >>>> >>>> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.hu...@gmail.com> >>>> wrote: >>>> >>>>> Hi Pedro, >>>>> >>>>> In 1.6.1, you can do: >>>>> >> ds.groupBy(_.uid).count().map(_._1) >>>>> or >>>>> >> ds.groupBy(_.uid).count().select($"value".as[String]) >>>>> >>>>> It doesn't have the exact same syntax as for DataFrame. >>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset >>>>> >>>>> It might be different in 2.0. >>>>> >>>>> Xinh >>>>> >>>>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez < >>>>> ski.rodrig...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its >>>>>> released. >>>>>> >>>>>> I am running the aggregate code below where I have a dataset where >>>>>> the row has a field uid: >>>>>> >>>>>> ds.groupBy(_.uid).count() >>>>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, >>>>>> _2: bigint] >>>>>> >>>>>> This works as expected, however, attempts to run select statements >>>>>> after fails: >>>>>> ds.groupBy(_.uid).count().select(_._1) >>>>>> // error: missing parameter type for expanded function ((x$2) => >>>>>> x$2._1) >>>>>> ds.groupBy(_.uid).count().select(_._1) >>>>>> >>>>>> I have tried several variants, but nothing seems to work. Below is >>>>>> the equivalent Dataframe code which works as expected: >>>>>> df.groupBy("uid").count().select("uid") >>>>>> >>>>>> Thanks! >>>>>> -- >>>>>> Pedro Rodriguez >>>>>> PhD Student in Distributed Machine Learning | CU Boulder >>>>>> UC Berkeley AMPLab Alumni >>>>>> >>>>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>>>>> Github: github.com/EntilZha | LinkedIn: >>>>>> https://www.linkedin.com/in/pedrorodriguezscience >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >>> >>> -- >>> Pedro Rodriguez >>> PhD Student in Distributed Machine Learning | CU Boulder >>> UC Berkeley AMPLab Alumni >>> >>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>> Github: github.com/EntilZha | LinkedIn: >>> https://www.linkedin.com/in/pedrorodriguezscience >>> >>> >> >> >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> > > > -- > --- > Takeshi Yamamuro > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience