Re: DataSourceV2 write input requirements

2018-03-28 Thread Russell Spitzer
Ah yeah sorry I got a bit mixed up. On Wed, Mar 28, 2018 at 7:54 PM Ted Yu wrote: > bq. this shuffle could outweigh the benefits of the organized data if the > cardinality is lower. > > I wonder if you meant higher in place of the last word above. > > Cheers > > On Wed, Mar

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ted Yu
bq. this shuffle could outweigh the benefits of the organized data if the cardinality is lower. I wonder if you meant higher in place of the last word above. Cheers On Wed, Mar 28, 2018 at 7:50 PM, Russell Spitzer wrote: > For added color, one thing that I may want

Re: DataSourceV2 write input requirements

2018-03-28 Thread Russell Spitzer
For added color, one thing that I may want to consider as a data source implementer is the cost / benefit of applying a particular clustering. For example, a dataset with low cardinality in the clustering key could benefit greatly from clustering on that key before writing to Cassandra since

Re: DataSourceV2 write input requirements

2018-03-28 Thread Patrick Woody
> > Spark would always apply the required clustering and sort order because > they are required by the data source. It is reasonable for a source to > reject data that isn’t properly prepared. For example, data must be written > to HTable files with keys in order or else the files are invalid.

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ryan Blue
How would Spark determine whether or not to apply a recommendation - a cost threshold? Spark would always apply the required clustering and sort order because they are required by the data source. It is reasonable for a source to reject data that isn’t properly prepared. For example, data must be

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-28 Thread Kimoon Kim
Thanks for starting this discussion. When I was troubleshooting Spark on K8s, I often faced a need to turn on debug messages on the driver and executor pods of my jobs, which would be possible if I somehow put the right log4j.properties file inside the pods. I know I can build custom Docker

Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-28 Thread Reynold Xin
If you need the functionality I would recommend you just copying the code over to your project and use it that way. On Wed, Mar 28, 2018 at 9:02 AM Felix Cheung wrote: > I think the difference is py4j is a public library whereas the R backend > is specific to SparkR.

Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-28 Thread Felix Cheung
I think the difference is py4j is a public library whereas the R backend is specific to SparkR. Can you elaborate what you need JVMObjectTracker for? We have provided R convenient APIs to call into JVM: sparkR.callJMethod for example _ From: Jeremy Liu

Re: Build issues with apache-spark-on-k8s.

2018-03-28 Thread Anirudh Ramanathan
As Lucas said, those directories are generated and copied when you run a full maven build with the -Pkubernetes flag specified (or use instructions in https://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution ). Also, using the Kubernetes integration in the main

Re: [Kubernetes] structured-streaming driver restarts / roadmap

2018-03-28 Thread Anirudh Ramanathan
We discussed this early on in our fork and I think we should have this in a JIRA and discuss it further. It's something we want to address in the future. One proposed method is using a StatefulSet of size 1 for the driver. This ensures recovery but at the same time takes away from the completion

Re: DataSourceV2 write input requirements

2018-03-28 Thread Patrick Woody
How would Spark determine whether or not to apply a recommendation - a cost threshold? And yes, it would be good to flesh out what information we get from Spark in the datasource when providing these recommendations/requirements - I could see statistics and the existing outputPartitioning/Ordering

[Kubernetes] structured-streaming driver restarts / roadmap

2018-03-28 Thread Lucas Kacher
A carry-over from the apache-spark-on-k8s project, it would be useful to have a configurable restart policy for submitted jobs with the Kubernetes resource manager. See the following issues: https://github.com/apache-spark-on-k8s/spark/issues/133

Re: Build issues with apache-spark-on-k8s.

2018-03-28 Thread Lucas Kacher
Are you building on the fork or on the official release now? I built v2.3.0 from source w/out issue. One thing I noticed is that I needed to run the build-image command from the bin which was placed in dist/ as opposed to the one in the repo (as that's how it copies the necessary targets).

Build issues with apache-spark-on-k8s.

2018-03-28 Thread Atul Sowani
Hi, I built apache-spark-on-k8s from source on Ubuntu 16.04 and it got built without errors. Next, I wanted to create docker images, so as explained at https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html I used sbin/build-push-docker-images.sh to create those. While using