Unit test error

2016-04-28 Thread JaeSung Jun
Hi All, I'm developing custom data source & relation provider based on spark 1.6.1. Every unit test has its own Spark Context, and it runs successfully when running one by one. But when running in sbt(sbt:test), error pops up when initializing spark contest like followings :

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-28 Thread Luciano Resende
Just want to provide a quick update that we have submitted the "Spark Extras" proposal for review by the Apache board (see link below with the contents). https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing Note that we are in the quest for a project

Re: SparkR unit test failures on local master

2016-04-28 Thread Gayathri Murali
I just rebuild spark and tried to run the tests again. Same failure. On Thu, Apr 28, 2016 at 1:39 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I just ran the tests using a recently synced master branch and the > tests seemed to work fine. My guess is some of the Java classes

ConvertToSafe being done before functions.explode

2016-04-28 Thread Hamel Kothari
Hi all, I've been looking at some of my query plans and noticed that pretty much every explode that I run (which is always over a column with ArrayData) is prefixed with a ConvertToSafe call in the physical plan. Looking at Generate.scala it looks like it doesn't override canProcessUnsafeRows in

Re: SparkR unit test failures on local master

2016-04-28 Thread Shivaram Venkataraman
I just ran the tests using a recently synced master branch and the tests seemed to work fine. My guess is some of the Java classes changed and you need to rebuild Spark ? Thanks Shivaram On Thu, Apr 28, 2016 at 1:19 PM, Gayathri Murali wrote: > Hi All, > > I am

SparkR unit test failures on local master

2016-04-28 Thread Gayathri Murali
Hi All, I am running the sparkR unit test(./R/run-tests.sh) on a local master branch and I am seeing the following issues with SparkR ML wrapper test cases. Failed - 1. Error: glm and predict (@test_mllib.R#31)

Re: Tungsten off heap memory access for C++ libraries

2016-04-28 Thread jpivar...@gmail.com
jpivar...@gmail.com wrote > P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA, > BridJ, and JavaCPP personally, but in the end picked JNA because of its > (comparatively) large user base. If Spark will be using Djinni, that could > be a symmetry-breaking consideration and I'll

Re: Tungsten off heap memory access for C++ libraries

2016-04-28 Thread jpivar...@gmail.com
Hi, I'm coming from the particle physics community and I'm also very interested in the development of this project. We have a huge C++ codebase and would like to start using the higher-level abstractions of Spark in our data analyses. To this end, I've been developing code that copies data from

Re: HDFS as Shuffle Service

2016-04-28 Thread Andrew Ray
Yes, HDFS has serious problems with creating lots of files. But we can always just create a single merged file on HDFS per task. On Apr 28, 2016 11:17 AM, "Reynold Xin" wrote: Hm while this is an attractive idea in theory, in practice I think you are substantially

Re: Spark ML - Scaling logistic regression for many features

2016-04-28 Thread Daniel Siegmann
FYI: https://issues.apache.org/jira/browse/SPARK-14464 I have submitted a PR as well. On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreath wrote: > No, I didn't yet - feel free to create a JIRA. > > > > On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
Ah, got it. While that would be useful, it doesn't address the more general (and potentially even more beneficial) case where the total number of worker nodes is fully elastic. That already starts to push you into the direction of spitting Spark worker and HDFS data nodes into disjoint sets, and

Unsubscribe

2016-04-28 Thread Varanasi, Venkata
-- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at

Re: RDD.broadcast

2016-04-28 Thread Reynold Xin
This is a nice feature in broadcast join. It is just a little bit complicated to do and as a result hasn't been prioritized as highly yet. On Thu, Apr 28, 2016 at 5:51 AM, wrote: > I was aiming to show the operations with pseudo-code, but I apparently > failed,

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt
Not disjoint. Colocated. By "shrinking", I don't mean any nodes are going away. I mean executors are decreasing in number, which is the case with dynamic allocation. HDFS nodes aren't decreasing in number though, and we can still colocate on those nodes, as always. On Thu, Apr 28, 2016 at

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt
Yea, it's an open question. I'm willing to create some benchmarks, but I'd first like to know that the feature would be accepted assuming the results are reasonable. Can a committer give me a thumbs up? On Thu, Apr 28, 2016 at 11:17 AM, Reynold Xin wrote: > Hm while this

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
So you are only considering the case where your set of HDFS nodes is disjoint from your dynamic set of Spark Worker nodes? That would seem to be a pretty significant sacrifice of data locality. On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt wrote: > > if after a

Re: HDFS as Shuffle Service

2016-04-28 Thread Reynold Xin
Hm while this is an attractive idea in theory, in practice I think you are substantially overestimating HDFS' ability to handle a lot of small, ephemeral files. It has never really been optimized for that use case. On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt wrote:

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt
> if after a work-load burst your cluster dynamically changes from 1 workers to 1000, will the typical HDFS replication factor be sufficient to retain access to the shuffle files in HDFS HDFS isn't resizing. Spark is. HDFS files should be HA and durable. On Thu, Apr 28, 2016 at 11:08 AM,

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
Yes, replicated and distributed shuffle materializations are key requirement to maintain performance in a fully elastic cluster where Executors aren't just reallocated across an essentially fixed number of Worker nodes, but rather the number of Workers itself is dynamic. Retaining the file

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt
> Why would you run the shuffle service on 10K nodes but Spark executors on just 100 nodes? wouldn't you also run that service just on the 100 nodes? We have to start the service beforehand, out of band, and we don't know a priori where the Spark executors will land. Those 100 executors could

Re: Using Spark when data definitions are unknowable at compile time

2016-04-28 Thread Dean Wampler
I would start with using DataFrames and the Row API, because you can fetch fields by index. Presumably, you'll parse the incoming data and determine what fields have what types, etc. Or, will someone specify the

Spark streaming concurrent job scheduling question

2016-04-28 Thread Renyi Xiong
Hi, I am trying to run an I/O intensive RDD in parallel with CPU intensive RDD within an application through a window like below: var ssc = new StreamingContext(sc, 1min); var ds1 = ... var ds2 = ds1.Window(2min).ForeachRDD(...) ds1.ForeachRDD(...) I hope ds1 to start its job at 1min interval

certification suite?

2016-04-28 Thread William Benton
Hi all, Does anyone happen to know what tests Databricks uses for the Spark distribution certification suite? Is it simply the tests that run as CI on Spark pull requests, or is there something more involved? The web site (

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
I was aiming to show the operations with pseudo-code, but I apparently failed, so Java it is ☺ Assume the following 3 datasets on HDFS. 1. RDD1: User (1 Million rows – 2GB ) Columns: uid, locationId, (extra stuff) 2. RDD2: Actions (1 Billion rows – 500GB) Columns: uid_1, uid_2

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin
I don't know what your notation really means. I'm very much unclear on why you can't use the filter method for 1. If you're talking about splitting/bucketing rather filtering as such I think that is a specific lacuna in spark's Api. I've generally found the join api to be entirely adequate for my

Re: RDD.broadcast

2016-04-28 Thread Mike Hynes
I second knowing the use case for interest. I can imagine a case where knowledge of the RDD key distribution would help local computations, for relaticely few keys, but would be interested to hear your motive. Essentially, are you trying to achieve what would be an all-reduce type operation in

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin
Why would you ever need to do this? I'm genuinely curious. I view collects as being solely for interactive work. On Thursday, April 28, 2016, wrote: > Hi, > > > > It is a common pattern to process an RDD, collect (typically a subset) to > the driver and then

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-28 Thread Guillaume Pitel
Long story short, regarding the performance issue, it appeared with recompiled version of the source TGZ downloaded from spark website. Problem disappears with 1.6.2-SNAPSHOT (branch-1.6) Guillaume Do you have code which can reproduce this performance drop in treeReduce? It would be helpful

RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
Hi, It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back. Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-28 Thread Guillaume Pitel
Le 27/04/2016 à 19:41, Joseph Bradley a écrit : Do you have code which can reproduce this performance drop in treeReduce? It would be helpful to debug. In the 1.6 release, we profiled it via the various MLlib algorithms and did not see performance drops. That would be difficult, but if we

Re: HDFS as Shuffle Service

2016-04-28 Thread Sean Owen
Why would you run the shuffle service on 10K nodes but Spark executors on just 100 nodes? wouldn't you also run that service just on the 100 nodes? What does plumbing it through HDFS buy you in comparison? There's some additional overhead and if anything you lose some control over locality, in a