Join operation on DStreams

2015-09-21 Thread guoxu1231
Hi Spark Experts, I'm trying to use join(otherStream, [numTasks]) on DStreams, and it requires called on two DStreams of (K, V) and (K, W) pairs, Usually in common RDD, we could use keyBy(f) to build the (K, V) pair, however I could not find it in DStream. My question is: What is the

RE: SparkR package path

2015-09-21 Thread Sun, Rui
Hossein, Any strong reason to download and install SparkR source package separately from the Spark distribution? An R user can simply download the spark distribution, which contains SparkR source and binary package, and directly use sparkR. No need to install SparkR package at all. From:

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
Yeah, whoever is maintaining the scripts and snapshot builds has fallen down on the job -- but there is nothing preventing you from checking out branch-1.5 and creating your own build, which is arguably a smarter thing to do anyway. If I'm going to use a non-release build, then I want the full

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Fengdong Yu
Do you mean you want to publish the artifact to your private repository? if so, please using ‘sbt publish’ add the following in your build.sb: publishTo := { val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/; if (version.value.endsWith("SNAPSHOT")) Some("snapshots" at nexus +

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
But I cannot find 1.5.1-SNAPSHOT either at https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/ Mark Hamstra 于2015年9月22日周二 下午12:55写道: > There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The > current head of

Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
I'd like to use some important bug fixes in 1.5 branch and I look for the apache maven host, but don't find any snapshot for 1.5 branch. https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/ I can find 1.4.X and 1.6.0 versions, why there is no

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
My project is using sbt (or maven), which need to download dependency from a maven repo. I have my own private maven repo with nexus but I don't know how to push my own build to it, can you give me a hint? Mark Hamstra 于2015年9月22日周二 下午1:25写道: > Yeah, whoever is

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
However I find some scripts in dev/audit-release, can I use them? Bin Wang 于2015年9月22日周二 下午1:34写道: > No, I mean push spark to my private repository. Spark don't have a > build.sbt as far as I see. > > Fengdong Yu 于2015年9月22日周二 下午1:29写道: > >> Do you

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Reynold Xin
What's the plan if you run explain? In 1.5 the default should be TungstenAggregate, which does spill (switching from hash-based aggregation to sort-based aggregation). On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah wrote: > Hi everyone, > > I’m debugging some slowness and

Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
can i use this sparkContext on executors ?? In my application, i have scenario of reading from db for certain records in rdd. Hence I need sparkContext to read from DB (cassandra in our case), If sparkContext couldn't be sent to executors , what is the workaround for this ?? On Mon, Sep 21,

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
sparkConext is available on the driver, not on executors. To read from Cassandra, you can use something like this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:27 PM,

Re: passing SparkContext as parameter

2015-09-21 Thread Ted Yu
You can use broadcast variable for passing connection information. Cheers > On Sep 21, 2015, at 4:27 AM, Priya Ch wrote: > > can i use this sparkContext on executors ?? > In my application, i have scenario of reading from db for certain records in > rdd. Hence I

Forecasting Library For Apache Spark

2015-09-21 Thread Mohamed Baddar
Hello everybody , this my first mail in the List , and i would like to introduce my self first :) My Name is Mohamed baddar , I work as Big Data and Analytics Software Engieer at BADRIT (http://badrit.com/) , a software Startup with focus in Big Data , also i have been working for 6+ years at IBM

Re: Forecasting Library For Apache Spark

2015-09-21 Thread Mohamed Baddar
Thanks Corey for the suggestion , will check it On Mon, Sep 21, 2015 at 2:43 PM, Corey Nolet wrote: > Mohamed, > > Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA > was added to this recently and seasonal ARIMA should be following shortly. > > [1]

Re: Forecasting Library For Apache Spark

2015-09-21 Thread Corey Nolet
Mohamed, Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA was added to this recently and seasonal ARIMA should be following shortly. [1] https://github.com/cloudera/spark-timeseries On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar wrote: >

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
I was executing on Spark 1.4 so I didn¹t notice the Tungsten option would make spilling happen in 1.5. I¹ll upgrade to 1.5 and see how that turns out. Thanks! From: Reynold Xin Date: Monday, September 21, 2015 at 5:36 PM To: Matt Cheah Cc:

Re: How to modify Hadoop APIs used by Spark?

2015-09-21 Thread Dogtail L
Oh, I want to modify existing Hadoop InputFormat. On Mon, Sep 21, 2015 at 4:23 PM, Ted Yu wrote: > Can you clarify what you want to do: > If you modify existing hadoop InputFormat, etc, it would be a matter of > rebuilding hadoop and build Spark using the custom built

DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
Hi everyone, I¹m debugging some slowness and apparent memory pressure + GC issues after I ported some workflows from raw RDDs to Data Frames. In particular, I¹m looking into an aggregation workflow that computes many aggregations per key at once. My workflow before was doing a fairly

SparkR package path

2015-09-21 Thread Hossein
Hi dev list, SparkR backend assumes SparkR source files are located under "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh. This setting makes sense for Spark developers, but if an R user downloads and installs SparkR source package, the source files are going to be in

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release candidates and then the 1.5.1 release. On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang wrote: > I'd like to use some important bug fixes

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
No, I mean push spark to my private repository. Spark don't have a build.sbt as far as I see. Fengdong Yu 于2015年9月22日周二 下午1:29写道: > Do you mean you want to publish the artifact to your private repository? > > if so, please using ‘sbt publish’ > > add the following in

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I

Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Hi, is there an existing way to blacklist any test suite? Ideally we'd have a text file with a series of names (let's say comma separated) and if a name matches with the fully qualified class name for a suite, this suite will be skipped. Perhaps we can achieve this via ScalaTest or Maven?

Re: Unsubscribe

2015-09-21 Thread Richard Hillegas
To unsubscribe from the dev list, please send a message to dev-unsubscr...@spark.apache.org as described here: http://spark.apache.org/community.html#mailing-lists. Thanks, -Rick Dulaj Viduranga wrote on 09/21/2015 10:15:58 AM: > From: Dulaj Viduranga

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Yin, You are right! I just tried the scala version with the above lines, it works as expected. I'm not sure if it happens also in 1.4 for pyspark but I thought the pyspark code just calls the scala code via py4j. I didn't expect that this bug is pyspark specific. That surprises me actually a

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira. On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote: > I just noticed you found 1.4 has the same issue. I

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue. On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > >> Hi

Unsubscribe

2015-09-21 Thread Dulaj Viduranga
Unsubscribe - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1),

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-21 Thread shane knapp
quick update: we actually did some of the maintenance on our systems after the berkeley-wide outage caused by one of our (non-jenkins) servers halting and catching fire. we'll still have some downtime early wednesday, but tomorrow's will be cancelled. i'll send out another update real soon now

Re: how to send additional configuration to the RDD after it was lazily created

2015-09-21 Thread Romi Kuntsman
What new information do you know after creating the RDD, that you didn't know at the time of it's creation? I think the whole point is that RDD is immutable, you can't change it once it was created. Perhaps you need to refactor your logic to know the parameters earlier, or create a whole new RDD

Re: How to modify Hadoop APIs used by Spark?

2015-09-21 Thread Ted Yu
Can you clarify what you want to do: If you modify existing hadoop InputFormat, etc, it would be a matter of rebuilding hadoop and build Spark using the custom built hadoop as dependency. Do you introduce new InputFormat ? Cheers On Mon, Sep 21, 2015 at 1:20 PM, Dogtail Ray

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is

How to modify Hadoop APIs used by Spark?

2015-09-21 Thread Dogtail Ray
Hi all, I find that Spark uses some Hadoop APIs such as InputFormat, InputSplit, etc., and I want to modify these Hadoop APIs. Do you know how can I integrate my modified Hadoop code into Spark? Great thanks!

Re: Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Thanks Josh, I should have added that we've tried with -DwildcardSuites and Maven and we use this helpful feature regularly (although this does result in building plenty of tests and running other tests in other modules too), so wondering if there's a more "streamlined" way - e.g. with junit