Hi Spark Experts,
I'm trying to use join(otherStream, [numTasks]) on DStreams, and it
requires called on two DStreams of (K, V) and (K, W) pairs,
Usually in common RDD, we could use keyBy(f) to build the (K, V) pair,
however I could not find it in DStream.
My question is:
What is the
Hossein,
Any strong reason to download and install SparkR source package separately from
the Spark distribution?
An R user can simply download the spark distribution, which contains SparkR
source and binary package, and directly use sparkR. No need to install SparkR
package at all.
From:
Yeah, whoever is maintaining the scripts and snapshot builds has fallen
down on the job -- but there is nothing preventing you from checking out
branch-1.5 and creating your own build, which is arguably a smarter thing
to do anyway. If I'm going to use a non-release build, then I want the
full
Do you mean you want to publish the artifact to your private repository?
if so, please using ‘sbt publish’
add the following in your build.sb:
publishTo := {
val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
if (version.value.endsWith("SNAPSHOT"))
Some("snapshots" at nexus +
But I cannot find 1.5.1-SNAPSHOT either at
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
Mark Hamstra 于2015年9月22日周二 下午12:55写道:
> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The
> current head of
I'd like to use some important bug fixes in 1.5 branch and I look for the
apache maven host, but don't find any snapshot for 1.5 branch.
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
I can find 1.4.X and 1.6.0 versions, why there is no
My project is using sbt (or maven), which need to download dependency from
a maven repo. I have my own private maven repo with nexus but I don't know
how to push my own build to it, can you give me a hint?
Mark Hamstra 于2015年9月22日周二 下午1:25写道:
> Yeah, whoever is
However I find some scripts in dev/audit-release, can I use them?
Bin Wang 于2015年9月22日周二 下午1:34写道:
> No, I mean push spark to my private repository. Spark don't have a
> build.sbt as far as I see.
>
> Fengdong Yu 于2015年9月22日周二 下午1:29写道:
>
>> Do you
What's the plan if you run explain?
In 1.5 the default should be TungstenAggregate, which does spill (switching
from hash-based aggregation to sort-based aggregation).
On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah wrote:
> Hi everyone,
>
> I’m debugging some slowness and
can i use this sparkContext on executors ??
In my application, i have scenario of reading from db for certain records
in rdd. Hence I need sparkContext to read from DB (cassandra in our case),
If sparkContext couldn't be sent to executors , what is the workaround for
this ??
On Mon, Sep 21,
sparkConext is available on the driver, not on executors.
To read from Cassandra, you can use something like this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Mon, Sep 21, 2015 at 2:27 PM,
You can use broadcast variable for passing connection information.
Cheers
> On Sep 21, 2015, at 4:27 AM, Priya Ch wrote:
>
> can i use this sparkContext on executors ??
> In my application, i have scenario of reading from db for certain records in
> rdd. Hence I
Hello everybody , this my first mail in the List , and i would like to
introduce my self first :)
My Name is Mohamed baddar , I work as Big Data and Analytics Software
Engieer at BADRIT (http://badrit.com/) , a software Startup with focus in
Big Data , also i have been working for 6+ years at IBM
Thanks Corey for the suggestion , will check it
On Mon, Sep 21, 2015 at 2:43 PM, Corey Nolet wrote:
> Mohamed,
>
> Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
> was added to this recently and seasonal ARIMA should be following shortly.
>
> [1]
Mohamed,
Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
was added to this recently and seasonal ARIMA should be following shortly.
[1] https://github.com/cloudera/spark-timeseries
On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar
wrote:
>
I was executing on Spark 1.4 so I didn¹t notice the Tungsten option would
make spilling happen in 1.5. I¹ll upgrade to 1.5 and see how that turns out.
Thanks!
From: Reynold Xin
Date: Monday, September 21, 2015 at 5:36 PM
To: Matt Cheah
Cc:
Oh, I want to modify existing Hadoop InputFormat.
On Mon, Sep 21, 2015 at 4:23 PM, Ted Yu wrote:
> Can you clarify what you want to do:
> If you modify existing hadoop InputFormat, etc, it would be a matter of
> rebuilding hadoop and build Spark using the custom built
Hi everyone,
I¹m debugging some slowness and apparent memory pressure + GC issues after I
ported some workflows from raw RDDs to Data Frames. In particular, I¹m
looking into an aggregation workflow that computes many aggregations per key
at once.
My workflow before was doing a fairly
Hi dev list,
SparkR backend assumes SparkR source files are located under
"SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
This setting makes sense for Spark developers, but if an R user downloads
and installs SparkR source package, the source files are going to be in
There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The
current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
candidates and then the 1.5.1 release.
On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang wrote:
> I'd like to use some important bug fixes
No, I mean push spark to my private repository. Spark don't have a
build.sbt as far as I see.
Fengdong Yu 于2015年9月22日周二 下午1:29写道:
> Do you mean you want to publish the artifact to your private repository?
>
> if so, please using ‘sbt publish’
>
> add the following in
Hi Spark Developers,
I just ran some very simple operations on a dataset. I was surprise by the
execution plan of take(1), head() or first().
For your reference, this is what I did in pyspark 1.5:
df=sqlContext.read.parquet("someparquetfiles")
df.head()
The above lines take over 15 minutes. I
Hi, is there an existing way to blacklist any test suite?
Ideally we'd have a text file with a series of names (let's say comma
separated) and if a name matches with the fully qualified class name for a
suite, this suite will be skipped.
Perhaps we can achieve this via ScalaTest or Maven?
To unsubscribe from the dev list, please send a message to
dev-unsubscr...@spark.apache.org as described here:
http://spark.apache.org/community.html#mailing-lists.
Thanks,
-Rick
Dulaj Viduranga wrote on 09/21/2015 10:15:58 AM:
> From: Dulaj Viduranga
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.
On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote:
> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens
Hi Yin,
You are right! I just tried the scala version with the above lines, it
works as expected.
I'm not sure if it happens also in 1.4 for pyspark but I thought the
pyspark code just calls the scala code via py4j. I didn't expect that this
bug is pyspark specific. That surprises me actually a
Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.
On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote:
> I just noticed you found 1.4 has the same issue. I
Seems 1.4 has the same issue.
On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create
btw, does 1.4 has the same problem?
On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote:
>
>> Hi
Unsubscribe
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
Hi Jerry,
Looks like it is a Python-specific issue. Can you create a JIRA?
Thanks,
Yin
On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote:
> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by the
> execution plan of take(1),
quick update: we actually did some of the maintenance on our systems
after the berkeley-wide outage caused by one of our (non-jenkins)
servers halting and catching fire.
we'll still have some downtime early wednesday, but tomorrow's will be
cancelled. i'll send out another update real soon now
What new information do you know after creating the RDD, that you didn't
know at the time of it's creation?
I think the whole point is that RDD is immutable, you can't change it once
it was created.
Perhaps you need to refactor your logic to know the parameters earlier, or
create a whole new RDD
Can you clarify what you want to do:
If you modify existing hadoop InputFormat, etc, it would be a matter of
rebuilding hadoop and build Spark using the custom built hadoop as
dependency.
Do you introduce new InputFormat ?
Cheers
On Mon, Sep 21, 2015 at 1:20 PM, Dogtail Ray
+dev list
Hi Dirceu,
The answer to whether throwing an exception is better or null is better
depends on your use case. If you are debugging and want to find bugs with
your program, you might prefer throwing an exception. However, if you are
running on a large real-world dataset (i.e. data is
Hi all,
I find that Spark uses some Hadoop APIs such as InputFormat, InputSplit,
etc., and I want to modify these Hadoop APIs. Do you know how can I
integrate my modified Hadoop code into Spark? Great thanks!
Thanks Josh, I should have added that we've tried with -DwildcardSuites
and Maven and we use this helpful feature regularly (although this does
result in building plenty of tests and running other tests in other
modules too), so wondering if there's a more "streamlined" way - e.g. with
junit
37 matches
Mail list logo