date:20160428

Unit test error

2016-04-28 Thread JaeSung Jun

Hi All, I'm developing custom data source & relation provider based on spark 1.6.1. Every unit test has its own Spark Context, and it runs successfully when running one by one. But when running in sbt(sbt:test), error pops up when initializing spark contest like followings : org.apache.spark.rpc.

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-28 Thread Luciano Resende

Just want to provide a quick update that we have submitted the "Spark Extras" proposal for review by the Apache board (see link below with the contents). https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing Note that we are in the quest for a project na

Re: SparkR unit test failures on local master

2016-04-28 Thread Gayathri Murali

I just rebuild spark and tried to run the tests again. Same failure. On Thu, Apr 28, 2016 at 1:39 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I just ran the tests using a recently synced master branch and the > tests seemed to work fine. My guess is some of the Java classes >

ConvertToSafe being done before functions.explode

2016-04-28 Thread Hamel Kothari

Hi all, I've been looking at some of my query plans and noticed that pretty much every explode that I run (which is always over a column with ArrayData) is prefixed with a ConvertToSafe call in the physical plan. Looking at Generate.scala it looks like it doesn't override canProcessUnsafeRows in S

Re: SparkR unit test failures on local master

2016-04-28 Thread Shivaram Venkataraman

I just ran the tests using a recently synced master branch and the tests seemed to work fine. My guess is some of the Java classes changed and you need to rebuild Spark ? Thanks Shivaram On Thu, Apr 28, 2016 at 1:19 PM, Gayathri Murali wrote: > Hi All, > > I am running the sparkR unit test(./R/r

SparkR unit test failures on local master

2016-04-28 Thread Gayathri Murali

Hi All, I am running the sparkR unit test(./R/run-tests.sh) on a local master branch and I am seeing the following issues with SparkR ML wrapper test cases. Failed - 1. Error: glm and predict (@test_mllib.R#31) --

Re: Tungsten off heap memory access for C++ libraries

2016-04-28 Thread jpivar...@gmail.com

jpivar...@gmail.com wrote > P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA, > BridJ, and JavaCPP personally, but in the end picked JNA because of its > (comparatively) large user base. If Spark will be using Djinni, that could > be a symmetry-breaking consideration and I'll sta

Re: Tungsten off heap memory access for C++ libraries

2016-04-28 Thread jpivar...@gmail.com

Hi, I'm coming from the particle physics community and I'm also very interested in the development of this project. We have a huge C++ codebase and would like to start using the higher-level abstractions of Spark in our data analyses. To this end, I've been developing code that copies data from ou

Re: HDFS as Shuffle Service

2016-04-28 Thread Andrew Ray

Yes, HDFS has serious problems with creating lots of files. But we can always just create a single merged file on HDFS per task. On Apr 28, 2016 11:17 AM, "Reynold Xin" wrote: Hm while this is an attractive idea in theory, in practice I think you are substantially overestimating HDFS' ability to

Re: Spark ML - Scaling logistic regression for many features

2016-04-28 Thread Daniel Siegmann

FYI: https://issues.apache.org/jira/browse/SPARK-14464 I have submitted a PR as well. On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreath wrote: > No, I didn't yet - feel free to create a JIRA. > > > > On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann > wrote: > >> Hi Nick, >> >> Thanks again for your

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra

Ah, got it. While that would be useful, it doesn't address the more general (and potentially even more beneficial) case where the total number of worker nodes is fully elastic. That already starts to push you into the direction of spitting Spark worker and HDFS data nodes into disjoint sets, and

Unsubscribe

2016-04-28 Thread Varanasi, Venkata

-- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.ban

Re: RDD.broadcast

2016-04-28 Thread Reynold Xin

This is a nice feature in broadcast join. It is just a little bit complicated to do and as a result hasn't been prioritized as highly yet. On Thu, Apr 28, 2016 at 5:51 AM, wrote: > I was aiming to show the operations with pseudo-code, but I apparently > failed, so Java it is J > > Assume the fo

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt

Not disjoint. Colocated. By "shrinking", I don't mean any nodes are going away. I mean executors are decreasing in number, which is the case with dynamic allocation. HDFS nodes aren't decreasing in number though, and we can still colocate on those nodes, as always. On Thu, Apr 28, 2016 at 11:1

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt

Yea, it's an open question. I'm willing to create some benchmarks, but I'd first like to know that the feature would be accepted assuming the results are reasonable. Can a committer give me a thumbs up? On Thu, Apr 28, 2016 at 11:17 AM, Reynold Xin wrote: > Hm while this is an attractive idea

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra

So you are only considering the case where your set of HDFS nodes is disjoint from your dynamic set of Spark Worker nodes? That would seem to be a pretty significant sacrifice of data locality. On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt wrote: > > if after a work-load burst your cluster

Re: HDFS as Shuffle Service

2016-04-28 Thread Reynold Xin

Hm while this is an attractive idea in theory, in practice I think you are substantially overestimating HDFS' ability to handle a lot of small, ephemeral files. It has never really been optimized for that use case. On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt wrote: > > if after a work-load

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt

> if after a work-load burst your cluster dynamically changes from 1 workers to 1000, will the typical HDFS replication factor be sufficient to retain access to the shuffle files in HDFS HDFS isn't resizing. Spark is. HDFS files should be HA and durable. On Thu, Apr 28, 2016 at 11:08 AM, M

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra

Yes, replicated and distributed shuffle materializations are key requirement to maintain performance in a fully elastic cluster where Executors aren't just reallocated across an essentially fixed number of Worker nodes, but rather the number of Workers itself is dynamic. Retaining the file interfac

Re: HDFS as Shuffle Service

2016-04-28 Thread Michael Gummelt

> Why would you run the shuffle service on 10K nodes but Spark executors on just 100 nodes? wouldn't you also run that service just on the 100 nodes? We have to start the service beforehand, out of band, and we don't know a priori where the Spark executors will land. Those 100 executors could lan

Re: Using Spark when data definitions are unknowable at compile time

2016-04-28 Thread Dean Wampler

I would start with using DataFrames and the Row API, because you can fetch fields by index. Presumably, you'll parse the incoming data and determine what fields have what types, etc. Or, will someone specify the sch

Using Spark when data definitions are unknowable at compile time

2016-04-28 Thread _na

We are looking to incorporate Spark into a timeseries data investigation application, but we are having a hard time transforming our workflow into the required transformations-on-data model. The crux of the problem is that we don’t know a priori which data will be required for our transformations.

Spark streaming concurrent job scheduling question

2016-04-28 Thread Renyi Xiong

Hi, I am trying to run an I/O intensive RDD in parallel with CPU intensive RDD within an application through a window like below: var ssc = new StreamingContext(sc, 1min); var ds1 = ... var ds2 = ds1.Window(2min).ForeachRDD(...) ds1.ForeachRDD(...) I hope ds1 to start its job at 1min interval ev

certification suite?

2016-04-28 Thread William Benton

Hi all, Does anyone happen to know what tests Databricks uses for the Spark distribution certification suite? Is it simply the tests that run as CI on Spark pull requests, or is there something more involved? The web site ( https://databricks.com/spark/certification/certified-spark-distribution)

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis

I was aiming to show the operations with pseudo-code, but I apparently failed, so Java it is ☺ Assume the following 3 datasets on HDFS. 1. RDD1: User (1 Million rows – 2GB ) Columns: uid, locationId, (extra stuff) 2. RDD2: Actions (1 Billion rows – 500GB) Columns: uid_1, uid_2 (extr

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin

I don't know what your notation really means. I'm very much unclear on why you can't use the filter method for 1. If you're talking about splitting/bucketing rather filtering as such I think that is a specific lacuna in spark's Api. I've generally found the join api to be entirely adequate for my

Re: RDD.broadcast

2016-04-28 Thread Mike Hynes

I second knowing the use case for interest. I can imagine a case where knowledge of the RDD key distribution would help local computations, for relaticely few keys, but would be interested to hear your motive. Essentially, are you trying to achieve what would be an all-reduce type operation in MPI

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis

One example pattern we have it doing joins or filters based on two datasets. E.g. 1 Filter –multiple- RddB for a given set extracted from RddA (keyword here is multiple times) a. RddA -> keyBy -> distinct -> collect() to Set A; b. RddB -> Filter using Set A; 2 “Join

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin

Why would you ever need to do this? I'm genuinely curious. I view collects as being solely for interactive work. On Thursday, April 28, 2016, wrote: > Hi, > > > > It is a common pattern to process an RDD, collect (typically a subset) to > the driver and then broadcast back. > > > > Adding an RDD

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-28 Thread Guillaume Pitel

Long story short, regarding the performance issue, it appeared with recompiled version of the source TGZ downloaded from spark website. Problem disappears with 1.6.2-SNAPSHOT (branch-1.6) Guillaume Do you have code which can reproduce this performance drop in treeReduce? It would be helpful t

RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis

Hi, It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back. Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap hold

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-28 Thread Guillaume Pitel

Le 27/04/2016 à 19:41, Joseph Bradley a écrit : Do you have code which can reproduce this performance drop in treeReduce? It would be helpful to debug. In the 1.6 release, we profiled it via the various MLlib algorithms and did not see performance drops. That would be difficult, but if we can

Re: HDFS as Shuffle Service

2016-04-28 Thread Sean Owen

Why would you run the shuffle service on 10K nodes but Spark executors on just 100 nodes? wouldn't you also run that service just on the 100 nodes? What does plumbing it through HDFS buy you in comparison? There's some additional overhead and if anything you lose some control over locality, in a c

Unit test error

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

Re: SparkR unit test failures on local master

ConvertToSafe being done before functions.explode

Re: SparkR unit test failures on local master

SparkR unit test failures on local master

Re: Tungsten off heap memory access for C++ libraries

Re: Tungsten off heap memory access for C++ libraries

Re: HDFS as Shuffle Service

Re: Spark ML - Scaling logistic regression for many features

Re: HDFS as Shuffle Service

Unsubscribe

Re: RDD.broadcast

Re: HDFS as Shuffle Service

Re: HDFS as Shuffle Service

Re: HDFS as Shuffle Service

Re: HDFS as Shuffle Service

Re: HDFS as Shuffle Service

Re: HDFS as Shuffle Service

Re: HDFS as Shuffle Service

Re: Using Spark when data definitions are unknowable at compile time

Using Spark when data definitions are unknowable at compile time

Spark streaming concurrent job scheduling question

certification suite?

RE: RDD.broadcast

Re: RDD.broadcast

Re: RDD.broadcast

RE: RDD.broadcast

Re: RDD.broadcast

Re: Decrease shuffle in TreeAggregate with coalesce ?

RDD.broadcast

Re: Decrease shuffle in TreeAggregate with coalesce ?

Re: HDFS as Shuffle Service

33 matches

Site Navigation

Mail list logo

Footer information