custom joins on dataframe

2017-07-22 Thread Stephen Fletcher
Normally a family of joins (left, right outter, inner) are performed on two dataframes using columns for the comparison ie left("acol") === ight("acol") . the comparison operator of the "left" dataframe does something internally and produces a column that i assume is used by the join. What I want

KTable like functionality in structured streaming

2017-05-16 Thread Stephen Fletcher
Are there any plans to add Kafka Streams KTable like functionality in structured streaming for kafka sources? Allowing querying keyed messages using spark sql,maybe calling KTables in the backend

Re: Spark books

2017-05-03 Thread Stephen Fletcher
Zeming, Jacek also has a really good online spark book for spark 2, "mastering spark". I found it very helpful when trying to understand spark 2's encoders. his book is here: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details On Wed, May 3, 2017 at 8:16 PM, Neelesh Salia

Contributed to spark

2017-04-07 Thread Stephen Fletcher
I'd like to eventually contribute to spark, but I'm noticing since spark 2 the query planner is heavily used throughout Dataset code base. Are there any sites I can go to that explain the technical details, more than just from a high-level prospective

reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to spark 2 I find myself increasing dissatisfied with the idea of converting dataframes to RDD to do procedural programming on grouped data(both from a ease of programming stance and performance stance). So I've been using Dataf

Re: attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
fer[Row]() buff += row (key,buff) } } ... On Sun, Feb 26, 2017 at 7:31 AM, Stephen Fletcher < stephen.fletc...@gmail.com> wrote: > I'm attempting to perform a map on a Dataset[Row] but getting an error on > decode when attempting to pass a custom encoder. > M

attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
I'm attempting to perform a map on a Dataset[Row] but getting an error on decode when attempting to pass a custom encoder. My code looks similar to the following: val source = spark.read.format("parquet").load("/emrdata/sources/very_large_ds") source.map{ row => { val key = row(0) }

DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Stephen Fletcher
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD? I'm reading data from a file data source and I want to partition this data I'm reading in to be partitioned the same way as the data I'm processing through a spark streaming RDD in the process.