[Spark Structured Streaming] How to select events by latest timestamp and aggregate count

2017-10-08 Thread Li Zuwei
I would like to perform structured streaming aggregation with a windowing period. Given this following data schema. The objective is to filter by the latest occurring event based on user. Then aggregate the count of each event type for each location. timelocation user type 1A

EMR: Use extra mounted EBS volumes for spark.local.dir

2017-10-08 Thread Tushar Sudake
Hello everyone, I'm using 'r4.8xlarge' instances on EMR for my Spark Application. To each node, I'm attaching one 512 GB EBS volume. By logging in into nodes I tried verifying that this volume is being set for 'spark.local.dir' by EMR automatically, but couldn't find any such configuration. Can

Re: Implement Dataset reader from SEQ file with protobuf to Dataset

2017-10-08 Thread Michael Armbrust
spark-avro would be a good example to start with. On Sun, Oct 8, 2017 at 3:00 AM, Serega Sheypak wrote: > Hi, did anyone try to implement Spark SQL dataset reader from SEQ file > with protobuf inside to Dataset? > > Imagine I

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-08 Thread kant kodali
I have the following so far private StructType getSchema() { return new StructType() .add("name", StringType) .add("address", StringType) .add("docs", StringType); } ds.select(explode_outer(from_json(ds.col("value"),

Re: Quick one... AWS SDK version?

2017-10-08 Thread Jonathan Kelly
Tushar, Yes, the hadoop-aws jar installed on an emr-5.8.0 cluster was built with AWS Java SDK 1.11.160, if that’s what you mean. ~ Jonathan On Sun, Oct 8, 2017 at 8:42 AM Tushar Sudake wrote: > Hi Jonathan, > > Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK

Re: Quick one... AWS SDK version?

2017-10-08 Thread Tushar Sudake
Hi Jonathan, Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK 1.11.160 and not 1.7.4? Thanks. On Oct 7, 2017 3:50 PM, "Jean Georges Perrin" wrote: Hey Marco, I am actually reading from S3 and I use 2.7.3, but I inherited the project and they use some AWS API from

Equivalent of Redshift ListAgg function in Spark (Pyspak)

2017-10-08 Thread Somasundaram Sekar
Hi, I want to concat multiple columns into a single column after grouping the DataFrame, I want an functional equivalent of Redshift ListAgg function pg_catalog.Listagg(column, '|') within GROUP( ORDER BY column) AS name LISTAGG Function : For each group in a query, the

Implement Dataset reader from SEQ file with protobuf to Dataset

2017-10-08 Thread Serega Sheypak
Hi, did anyone try to implement Spark SQL dataset reader from SEQ file with protobuf inside to Dataset? Imagine I have protobuf def Person - name: String - lastName: String - phones: List[String] and generated scala case class: case class Person(name:String, lastName: String, phones: