custom joins on dataframe
Normally a family of joins (left, right outter, inner) are performed on two dataframes using columns for the comparison ie left("acol") === ight("acol") . the comparison operator of the "left" dataframe does something internally and produces a column that i assume is used by the join. What I want is to create my own comparison operation (i have a case where i want to use some fuzzy matching between rows and if they fall within some threshold we allow the join to happen) so it would look something like left.join(right, my_fuzzy_udf (left("cola"),right("cola"))) Where my_fuzzy_udf is my defined UDF. My main concern is the column that would have to be output what would its value be ie what would the function need to return that the udf susbsystem would then turn to a column to be evaluated by the join. Thanks in advance for any advice
KTable like functionality in structured streaming
Are there any plans to add Kafka Streams KTable like functionality in structured streaming for kafka sources? Allowing querying keyed messages using spark sql,maybe calling KTables in the backend
Re: Spark books
Zeming, Jacek also has a really good online spark book for spark 2, "mastering spark". I found it very helpful when trying to understand spark 2's encoders. his book is here: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details On Wed, May 3, 2017 at 8:16 PM, Neelesh Salianwrote: > The Apache Spark documentation is good to begin with. > All the programming guides, particularly. > > > On Wed, May 3, 2017 at 5:07 PM, ayan guha wrote: > >> I would suggest do not buy any book, just start with databricks community >> edition >> >> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede wrote: >> >>> Well that is the nature of technology, ever evolving. There will always >>> be new concepts. If you're trying to get started ASAP and the internet >>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of >>> production stacks are still on that version and the knowledge from >>> mastering 1.6 is transferable to 2+. I think that beats waiting forever. >>> >>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu wrote: >>> I'm trying to decide whether to buy the book learning spark, spark for machine learning etc. or wait for a new edition covering the new concepts like dataframe and datasets. Anyone got any suggestions? >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Regards, > Neelesh S. Salian > >
Contributed to spark
I'd like to eventually contribute to spark, but I'm noticing since spark 2 the query planner is heavily used throughout Dataset code base. Are there any sites I can go to that explain the technical details, more than just from a high-level prospective
reducebykey
Are there plans to add reduceByKey to dataframes, Since switching over to spark 2 I find myself increasing dissatisfied with the idea of converting dataframes to RDD to do procedural programming on grouped data(both from a ease of programming stance and performance stance). So I've been using Dataframe's experimental groupByKey and flatMapGroups which perform extremely well, I'm guessing because of the encoders, but the amount of data being transfers is a little excessive. Is there any plans to port reduceByKey ( and additionally a reduceByKeyleft and right)?
Re: attempting to map Dataset[Row]
sorry here's the whole code val source = spark.read.format("parquet").load("/emrdata/sources/very_large_ds") implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[(Any,ArrayBuffer[Row])] source.map{ row => { val key = row(0) val buff = new ArrayBuffer[Row]() buff += row (key,buff) } } ... On Sun, Feb 26, 2017 at 7:31 AM, Stephen Fletcher < stephen.fletc...@gmail.com> wrote: > I'm attempting to perform a map on a Dataset[Row] but getting an error on > decode when attempting to pass a custom encoder. > My code looks similar to the following: > > > val source = spark.read.format("parquet").load("/emrdata/sources/very_ > large_ds") > > > > source.map{ row => { > val key = row(0) > >} > } >
attempting to map Dataset[Row]
I'm attempting to perform a map on a Dataset[Row] but getting an error on decode when attempting to pass a custom encoder. My code looks similar to the following: val source = spark.read.format("parquet").load("/emrdata/sources/very_large_ds") source.map{ row => { val key = row(0) } }
DataFrame equivalent to RDD.partionByKey
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD? I'm reading data from a file data source and I want to partition this data I'm reading in to be partitioned the same way as the data I'm processing through a spark streaming RDD in the process.