Normally a family of joins (left, right outter, inner) are performed on two
dataframes using columns for the comparison ie left("acol") ===
ight("acol") . the comparison operator of the "left" dataframe does
something internally and produces a column that i assume is used by the
join.
What I want
Are there any plans to add Kafka Streams KTable like functionality in
structured streaming for kafka sources? Allowing querying keyed messages
using spark sql,maybe calling KTables in the backend
Zeming,
Jacek also has a really good online spark book for spark 2, "mastering
spark". I found it very helpful when trying to understand spark 2's
encoders.
his book is here:
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details
On Wed, May 3, 2017 at 8:16 PM, Neelesh Salia
I'd like to eventually contribute to spark, but I'm noticing since spark 2
the query planner is heavily used throughout Dataset code base. Are there
any sites I can go to that explain the technical details, more than just
from a high-level prospective
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
Dataf
fer[Row]()
buff += row
(key,buff)
}
}
...
On Sun, Feb 26, 2017 at 7:31 AM, Stephen Fletcher <
stephen.fletc...@gmail.com> wrote:
> I'm attempting to perform a map on a Dataset[Row] but getting an error on
> decode when attempting to pass a custom encoder.
> M
I'm attempting to perform a map on a Dataset[Row] but getting an error on
decode when attempting to pass a custom encoder.
My code looks similar to the following:
val source =
spark.read.format("parquet").load("/emrdata/sources/very_large_ds")
source.map{ row => {
val key = row(0)
}
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD?
I'm reading data from a file data source and I want to partition this data
I'm reading in to be partitioned the same way as the data I'm processing
through a spark streaming RDD in the process.