Re: Unit test logs in Jenkins?

2015-04-02 Thread Steve Loughran
> On 2 Apr 2015, at 06:31, Patrick Wendell wrote: > > Hey Marcelo, > > Great question. Right now, some of the more active developers have an > account that allows them to log into this cluster to inspect logs (we > copy the logs from each run to a node on that cluster). The > infrastructure is

org.spark-project.jetty and guava repo locations

2015-04-02 Thread Niranda Perera
Hi, I am looking for the org.spark-project.jetty and org.spark-project.guava repo locations but I'm unable to find it in the maven repository. are these publicly available? rgds -- Niranda

Support for cycles in spark streaming topology ?????

2015-04-02 Thread anshu shukla
I didn't find any documentation regarding support for cycles in spark topology , although storm supports this using manual configuration in acker function logic (setting it to a particular count) .By cycles i doesn't mean infinite loops . Can any body please help me in that . -- Thanks &

[sql] Dataframe how to check null values

2015-04-02 Thread Peter Rudenko
Hi i need to implement MeanImputor - impute missing values with mean. If i set missing values to null - then dataframe aggregation works properly, but in UDF it treats null values to 0.0. Here’s example: |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF df.agg(avg("_1")).firs

Re: org.spark-project.jetty and guava repo locations

2015-04-02 Thread Ted Yu
Take a look at the maven-shade-plugin in pom.xml. Here is the snippet for org.spark-project.jetty : org.eclipse.jetty org.spark-project.jetty org.eclipse.jetty.** On Thu, Apr 2, 2015 at 3:59 AM, Ni

Re: [sql] Dataframe how to check null values

2015-04-02 Thread Dean Wampler
I'm afraid you're a little stuck. In Scala, the types Int, Long, Float, Double, Byte, and Boolean look like reference types in source code, but they are compiled to the corresponding JVM primitive types, which can't be null. That's why you get the warning about ==. It might be your best choice is

Re: Support for cycles in spark streaming topology ?????

2015-04-02 Thread Sean Owen
You can have diamonds but not cycles in the dependency graph. But what you are describing really sounds like simple iteration, since presumably you mean that the state of each element in the 'cycle' changes each time, and so isn't really the same element each time, and eventually you decide to sto

Re: One corrupt gzip in a directory of 100s

2015-04-02 Thread Romi Kuntsman
Hi Ted, Not sure what's the config value, I'm using s3n filesystem and not s3. The error that I get is the following: (so does that mean it's 4 retries?) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost ta

Re: One corrupt gzip in a directory of 100s

2015-04-02 Thread Ted Yu
S3n is governed by the same config parameter. Cheers > On Apr 2, 2015, at 7:33 AM, Romi Kuntsman wrote: > > Hi Ted, > Not sure what's the config value, I'm using s3n filesystem and not s3. > > The error that I get is the following: > (so does that mean it's 4 retries?) > > Caused by: org.

Re: Unit test logs in Jenkins?

2015-04-02 Thread Nicholas Chammas
This is secondary to Marcelo’s question, but I wanted to comment on this: Its main limitation is more cultural than technical: you need to get people to care about intermittent test runs, otherwise you can end up with failures that nobody keeps on top of This is a problem that plagues Spark as we

Re: Unit test logs in Jenkins?

2015-04-02 Thread shane knapp
i agree with all of this. but can we please break up the tests and make them shorter? :) On Thu, Apr 2, 2015 at 8:54 AM, Nicholas Chammas wrote: > This is secondary to Marcelo’s question, but I wanted to comment on this: > > Its main limitation is more cultural than technical: you need to get

Test all the things (Was: Unit test logs in Jenkins?)

2015-04-02 Thread Nicholas Chammas
(Renaming thread so as to un-hijack Marcelo's request.) Sure, we definitely want tests running faster. Part of "testing all the things" will be factoring out stuff from the various builds that can be run just once. We've also tried in the past (with little success) to parallelize test execution

Re: Spark config option 'expression language' feedback request

2015-04-02 Thread Imran Rashid
IMO, spark's config is kind of a mess right now. I completely agree with Reynold that Spark's handling of config ought to be super-simple, its not the kind of thing we want to put much effort in spark itself. It sounds so trivial that everyone wants to redo it, but then all these additional featu

Re: Stochastic gradient descent performance

2015-04-02 Thread Joseph Bradley
It looks like SPARK-3250 was applied to the sample() which GradientDescent uses, and that should kick in for your minibatchFraction <= 0.4. Based on your numbers, aggregation seems like the main issue, though I hesitate to optimize aggregation based on local tests for data sizes that small. The f

Re: [sql] Dataframe how to check null values

2015-04-02 Thread Reynold Xin
Incidentally, we were discussing this yesterday. Here are some thoughts on null handling in SQL/DataFrames. Would be great to get some feedback. 1. Treat floating point NaN and null as the same "null" value. This would be consistent with most SQL databases, and Pandas. This would also require some

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
Hi Joseph, Thank you for suggestion! It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches. Could you suggest how to implement it? With regards to aggregate (reduce), I am wondering why it works so slow in local mode? Could you elaborate on

Re: Unit test logs in Jenkins?

2015-04-02 Thread Marcelo Vanzin
On Thu, Apr 2, 2015 at 3:01 AM, Steve Loughran wrote: >>> That would be really helpful to debug build failures. The scalatest >>> output isn't all that helpful. >>> > > Potentially an issue with the test runner, rather than the tests themselves. Sorry, that was me over-generalizing. The output is

Re: Stochastic gradient descent performance

2015-04-02 Thread Joseph Bradley
When you say "It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches," are you sure that holds true for a big dataset in a cluster? As far as implementing it, I haven't looked carefully at GapSamplingIterator (in RandomSampler.scala) myself, bu

Re: Test all the things (Was: Unit test logs in Jenkins?)

2015-04-02 Thread shane knapp
cool. FYI, i'm at databricks today and talked w/patrick, josh and davies about this. we have some great ideas to actually make this happen and will be pushing over the next few weeks to get it done. :) On Thu, Apr 2, 2015 at 9:21 AM, Nicholas Chammas wrote: > (Renaming thread so as to un-hija

Re: Stochastic gradient descent performance

2015-04-02 Thread Shivaram Venkataraman
I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode) with the way aggregation is done in Spark. Launching a stage of tasks, fetching outputs from the previous stage etc. all have overhead, so I would say it

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
Hi Shivaram, It sounds really interesting! With this time we can estimate if it worth considering to run an iterative algorithm on Spark. For example, for SGD on Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data by one example not considering the data loading, comp

Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Xiangrui Meng
This is great! Thanks! -Xiangrui On Wed, Apr 1, 2015 at 12:11 PM, Ulanov, Alexander wrote: > FYI, I've added instructions to Netlib-java wiki, Sam added the link to them > from the project's readme.md > https://github.com/fommil/netlib-java/wiki/NVBLAS > > Best regards, Alexander > -Original

Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Evan R. Sparks
Yeah, thanks Alex! On Thu, Apr 2, 2015 at 5:05 PM, Xiangrui Meng wrote: > This is great! Thanks! -Xiangrui > > On Wed, Apr 1, 2015 at 12:11 PM, Ulanov, Alexander > wrote: > > FYI, I've added instructions to Netlib-java wiki, Sam added the link to > them from the project's readme.md > > https://

Re: Support for cycles in spark streaming topology ?????

2015-04-02 Thread Tathagata Das
Just to add to that, DStream.transform allows you do to arbitrary RDD-to-RDD function. Inside that you can do iterative RDD operations as well. On Thu, Apr 2, 2015 at 6:27 AM, Sean Owen wrote: > You can have diamonds but not cycles in the dependency graph. > > But what you are describing really