Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-02 Thread Pralabh Kumar
I am performing join operation , if I convert reduce side join to map side (no shuffle will happen) and I assume in that case this error shouldn't come. Let me know if this understanding is correct On Tue, May 1, 2018 at 9:37 PM, Ryan Blue wrote: > This is usually caused by

Re: SparkR test failures in PR builder

2018-05-02 Thread Kazuaki Ishizaki
I am not familiar with SparkR or CRAN. However, I remember that we had the similar situation. Here is a great work at that time. When I have just visited this PR, I think that we have the similar situation (i.e. format error) again. https://github.com/apache/spark/pull/20005 Any other comments

SparkR test failures in PR builder

2018-05-02 Thread Joseph Bradley
Hi all, Does anyone know why the PR builder keeps failing on SparkR's CRAN checks? I've seen this in a lot of unrelated PRs. E.g.: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console Hossein spotted this line: ``` * checking CRAN incoming feasibility ...Error in

AccumulatorV2 vs AccumulableParam (V1)

2018-05-02 Thread Sergey Zhemzhitsky
Hello guys, I've started to migrate my Spark jobs which use Accumulators V1 to AccumulatorV2 and faced with the following issues: 1. LegacyAccumulatorWrapper now requires the resulting type of AccumulableParam to implement equals. In other case the AccumulableParam, automatically wrapped into

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jakub Wozniak
Hello, Thanks a lot for your answers. We normally look for some stability so the use of internal APIs that are a subject to change with no warning are somewhat questionable. As to the approach of putting this functionality on top of Spark instead of a datasource - this works but poses a

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jörn Franke
Some note on the internal API - it used to change with each release which was quiet annoying because other data sources (Avro, HadoopOffice etc) had to follow up in this. In the end it is an internal API and thus does not guarantee to be stable. If you want to have something stable you have to

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jörn Franke
Spark at some point in time used for the formats shipped with Spark (eg parquet) an internal API that is not the data source API. You can look on how this is implemented for Parquet and co in the Spark source code. Maybe this is the issue you are facing? Have you tried to put your

Re: [build system] meet your build engineer @ spark ai summit SF 2018

2018-05-02 Thread shane knapp
whoops, i forgot to add that we'll have some demo/tutorials running of the latest projects coming out of the lab: https://ray.readthedocs.io/ https://clipper.ai/ i'll post an update closer to the date, but i'm really excited about the new projects! :) On Wed, May 2, 2018 at 11:11 AM, shane

[build system] meet your build engineer @ spark ai summit SF 2018

2018-05-02 Thread shane knapp
hey everyone! if you ever wanted to meet the one-man operation that keeps things going, talk about future build system plans, complain about the fact that we're still on centos 6 (yes, i know), or just say hi, i'll be manning the RISELab booth at summit all three days! :) shane -- Shane Knapp

Custom datasource as a wrapper for existing ones?

2018-05-02 Thread jwozniak
Hello, At CERN we are developing a Big Data system called NXCALS that uses Spark as Extraction API. We have implemented a custom datasource that was wrapping 2 existing ones (parquet and Hbase) in order to hide the implementation details (location of the parquet files, hbase tables, etc) and to