custom joins on dataframe

2017-07-22 Thread Stephen Fletcher
Normally a family of joins (left, right outter, inner) are performed on two
dataframes using columns for the comparison ie left("acol") ===
ight("acol") . the comparison operator of the "left" dataframe does
something internally and produces a column that i assume is used by the
join.

What I want is to create my own comparison operation (i have a case where i
want to use some fuzzy matching between rows and if they fall within some
threshold we allow the join to happen)

so it would look something like

left.join(right, my_fuzzy_udf (left("cola"),right("cola")))

Where my_fuzzy_udf  is my defined UDF. My main concern is the column that
would have to be output what would its value be ie what would the function
need to return that the udf susbsystem would then turn to a column to be
evaluated by the join.


Thanks in advance for any advice


KTable like functionality in structured streaming

2017-05-16 Thread Stephen Fletcher
Are there any plans to add Kafka Streams KTable like functionality in
structured streaming for kafka sources? Allowing querying keyed messages
using spark sql,maybe calling KTables in the backend


Re: Spark books

2017-05-03 Thread Stephen Fletcher
Zeming,

Jacek also has a really good online spark book for spark 2, "mastering
spark". I found it very helpful when trying to understand spark 2's
encoders.

his book is here:
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details


On Wed, May 3, 2017 at 8:16 PM, Neelesh Salian 
wrote:

> The Apache Spark documentation is good to begin with.
> All the programming guides, particularly.
>
>
> On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:
>
>> I would suggest do not buy any book, just start with databricks community
>> edition
>>
>> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>>
>>> Well that is the nature of technology, ever evolving. There will always
>>> be new concepts. If you're trying to get started ASAP and the internet
>>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
>>> production stacks are still on that version and the knowledge from
>>> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>>>
>>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>>>
 I'm trying to decide whether to buy the book learning spark, spark for
 machine learning etc. or wait for a new edition covering the new concepts
 like dataframe and datasets. Anyone got any suggestions?

>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>


Contributed to spark

2017-04-07 Thread Stephen Fletcher
I'd like to eventually contribute to spark, but I'm noticing since spark 2
the query planner is heavily used throughout Dataset code base. Are there
any sites I can go to that explain the technical details, more than just
from a high-level prospective


reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
Dataframe's experimental groupByKey and flatMapGroups which perform
extremely well, I'm guessing because of the encoders, but the amount of
data being transfers is a little excessive. Is there any plans to port
reduceByKey ( and additionally a reduceByKeyleft and right)?


Re: attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
sorry here's the whole code

val source =
spark.read.format("parquet").load("/emrdata/sources/very_large_ds")

implicit val mapEncoder =
org.apache.spark.sql.Encoders.kryo[(Any,ArrayBuffer[Row])]

source.map{ row => {
  val key = row(0)
  val buff = new ArrayBuffer[Row]()
  buff += row
  (key,buff)
   }
}

...

On Sun, Feb 26, 2017 at 7:31 AM, Stephen Fletcher <
stephen.fletc...@gmail.com> wrote:

> I'm attempting to perform a map on a Dataset[Row] but getting an error on
> decode when attempting to pass a custom encoder.
>  My code looks similar to the following:
>
>
> val source = spark.read.format("parquet").load("/emrdata/sources/very_
> large_ds")
>
>
>
> source.map{ row => {
>   val key = row(0)
>
>}
> }
>


attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
I'm attempting to perform a map on a Dataset[Row] but getting an error on
decode when attempting to pass a custom encoder.
 My code looks similar to the following:


val source =
spark.read.format("parquet").load("/emrdata/sources/very_large_ds")



source.map{ row => {
  val key = row(0)

   }
}


DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Stephen Fletcher
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD?
I'm reading data from a file data source and I want to partition this data
I'm reading in to be partitioned the same way as the data I'm processing
through a spark streaming RDD in the process.