You can look at the PRs that migrate builtin streaming data sources to the
V2 API:
https://github.com/apache/spark/pulls?utf8=%E2%9C%93&q=is%3Apr+migrate+in%3Atitle+is%3Aclosed+
On Thu, Mar 22, 2018 at 12:58 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:
> Thanks Wenchen, - yes, I did
Thanks Wenchen, - yes, I did refer to the Spark inbuilt sources as mentioned
earlier and have been using the Kafka streaming as a reference example.
The builtin ones work and use the internalCreateDataFrame - and that's where I
got the idea about using the method to set the "isStreaming" to true.
org.apache.spark.sql.execution.streaming.Source is for internal use only.
The official stream data source API is the data source v2 API. You can take
a look at the Spark built-in streaming data sources as examples. Note: data
source v2 is still experimental, you may need to update your code in a ne
Hi Ryan,
Thanks for the quick reply - I like the Iceberg approach, will keep an eye on
it.
So creating custom batch/non-streaming data source is not difficult.
The issue I have is when a streaming data source.
Similar to batch source, you need to implement a simple trait -
org.apache.spark.sq
Jayesh,
We're working on a new API for building sources, DataSourceV2. That API
allows you to produce UnsafeRow and we are very likely going to change that
to InternalRow (SPARK-23325). There's an experimental version in the latest
2.3.0 release if you'd like to try it out.
Here's an example impl
Re: Hadoop versioning – it seems reasonable enough for us to be publishing an
image per Hadoop version. We should essentially have image configuration parity
with what we publish as distributions on the Spark website.
Sometimes jars need to be swapped out entirely instead of being strictly
a
I would like to add that many people run Spark behind corporate proxies. It’s
very common to add http proxy to extraJavaOptions. Being able to provide
custom extraJavaOption should be supported.
Also, Hadoop FS 2.7.3 is pretty limited wrt S3 buckets. You cannot use
temporary AWS tokens. You ca
I created SPARK-23579 and apache/spark#20731 to try and add some more
information to the documentation. It's been open for a few days now and I was
wondering who I should try to contact to take a look at this. I'm hoping
someone here can give me some pointers. Perhaps I should mention some speci
Because these are not exposed in the usual API, its not possible (or difficult)
to create custom structured streaming sources.
Consequently, one has to create streaming sources in packages under
org.apache.spark.sql.
Any pointers or info is greatly appreciated.
Is this in spark-shell or a spark-submit job?
If spark-submit job, is it local or cluster?
One reliable way of adding jars is to use the command line option "--jars"
See
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
for more info.
If you add ja
The difficulty with a custom Spark config is that you need to be careful that
the Spark config the user provides does not conflict with the auto-generated
portions of the Spark config necessary to make Spark on K8S work. So part of
any “API” definition might need to be what Spark config is cons
11 matches
Mail list logo