Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Wenchen Fan
You can look at the PRs that migrate builtin streaming data sources to the V2 API: https://github.com/apache/spark/pulls?utf8=%E2%9C%93&q=is%3Apr+migrate+in%3Atitle+is%3Aclosed+ On Thu, Mar 22, 2018 at 12:58 PM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Thanks Wenchen, - yes, I did

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
Thanks Wenchen, - yes, I did refer to the Spark inbuilt sources as mentioned earlier and have been using the Kafka streaming as a reference example. The builtin ones work and use the internalCreateDataFrame - and that's where I got the idea about using the method to set the "isStreaming" to true.

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Wenchen Fan
org.apache.spark.sql.execution.streaming.Source is for internal use only. The official stream data source API is the data source v2 API. You can take a look at the Spark built-in streaming data sources as examples. Note: data source v2 is still experimental, you may need to update your code in a ne

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
Hi Ryan, Thanks for the quick reply - I like the Iceberg approach, will keep an eye on it. So creating custom batch/non-streaming data source is not difficult. The issue I have is when a streaming data source. Similar to batch source, you need to implement a simple trait - org.apache.spark.sq

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Ryan Blue
Jayesh, We're working on a new API for building sources, DataSourceV2. That API allows you to produce UnsafeRow and we are very likely going to change that to InternalRow (SPARK-23325). There's an experimental version in the latest 2.3.0 release if you'd like to try it out. Here's an example impl

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Matt Cheah
Re: Hadoop versioning – it seems reasonable enough for us to be publishing an image per Hadoop version. We should essentially have image configuration parity with what we publish as distributions on the Spark website. Sometimes jars need to be swapped out entirely instead of being strictly a

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Lalwani, Jayesh
I would like to add that many people run Spark behind corporate proxies. It’s very common to add http proxy to extraJavaOptions. Being able to provide custom extraJavaOption should be supported. Also, Hadoop FS 2.7.3 is pretty limited wrt S3 buckets. You cannot use temporary AWS tokens. You ca

Improving documentation

2018-03-22 Thread Mr riaas mokiem
I created SPARK-23579 and apache/spark#20731 to try and add some more information to the documentation. It's been open for a few days now and I was wondering who I should try to contact to take a look at this. I'm hoping someone here can give me some pointers. Perhaps I should mention some speci

Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
Because these are not exposed in the usual API, its not possible (or difficult) to create custom structured streaming sources. Consequently, one has to create streaming sources in packages under org.apache.spark.sql. Any pointers or info is greatly appreciated.

Re: "Spark.jars not adding jars to classpath"

2018-03-22 Thread Thakrar, Jayesh
Is this in spark-shell or a spark-submit job? If spark-submit job, is it local or cluster? One reliable way of adding jars is to use the command line option "--jars" See http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management for more info. If you add ja

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Rob Vesse
The difficulty with a custom Spark config is that you need to be careful that the Spark config the user provides does not conflict with the auto-generated portions of the Spark config necessary to make Spark on K8S work.  So part of any “API” definition might need to be what Spark config is cons