After sbt/sbt gen-diea, do not import as an SBT project but choose
open project and point it to the spark folder. -Xiangrui
On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen so...@cloudera.com wrote:
I let IntelliJ read the Maven build directly and that works fine.
--
Sean Owen | Director, Data
On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak ano...@soe.ucsc.edu wrote:
What, exactly, needs to be true about the RDDs that you pass to Graph() to
be sure of constructing a valid graph? (Do they need to have the same
number of partitions? The same number of partitions and no empty
partitions?
Hi All,
I am getting this exception when doing ssc.start to start the streaming
context.
ERROR KafkaReceiver - Error receiving data
akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not
unique!
at
Hi to everybody,
I'm new to Spark and I'd like to know if running Spark on top of YARN or
Mesos could affect (and how much) its performance. Is there any doc about
this?
Best,
Flavio
A JVM can easily be limited in how much memory it uses with the -Xmx
parameter, but Python doesn't have memory limits built in in such a
first-class way. Maybe the memory limits aren't making it to the python
executors.
What was your SPARK_MEM setting? The JVM below seems to be using 603201
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used. It's applied to that RDD, so that subsequent uses of the RDD can
use the
Which persistence level are you talking about? MEMORY_AND_DISK ?
Sent from my mobile phone
On Apr 9, 2014 2:28 PM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Thanks, Andrew. That helps.
For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after
Marco,
If you call spark-ec2 launch without specifying an AMI, it will default to
the Spark-provided AMI.
Nick
On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini
silvio.costant...@granatads.com wrote:
Hi there,
To answer your question; no there is no reason NOT to use an AMI that
Spark has
And for the record, that AMI is ami-35b1885c. Again, you don't need to
specify it explicitly; spark-ec2 will default to it.
On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Marco,
If you call spark-ec2 launch without specifying an AMI, it will default to
Up until last week we had no problems running a Spark standalone cluster. We
now have a problem registering executors with the driver node in any
application. Although we can start-all and see the worker on 8080 no
executors are registered with the blockmanager.
The feedback we have is scant but
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual
AMIs.
On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
And for the record, that AMI is ami-35b1885c. Again, you don't need to
specify it explicitly; spark-ec2 will default to it.
The AMI should automatically switch between PVM and HVM based on the
instance type you specify on the command line. For reference (note you
don't need to specify this on the command line), the PVM ami id
is ami-5bb18832 in us-east-1.
FWIW we maintain the list of AMI Ids (across regions and pvm,
The typical way to handle that use case would be to join the 3 files
together into one RDD and then do the factorization on that. There will
definitely be network traffic during the initial join to get everything
into one table, and after that there will likely be more network traffic
for various
The groupByKey would be aware of the subsequent persist -- that's part of
the reason why operations are lazy. As far as whether it's materialized in
memory first and then flushed to disk vs streamed to disk I'm not sure the
exact behavior.
What I'd expect to happen would be that the RDD is
Dear list,
A quick question about spark streaming:
Say I have this stage set up in my Spark Streaming cluster:
batched TCP stream == map(expensive computation) === ReduceByKey
I know I can set the number of tasks for ReduceByKey.
But I didn't find a place to specify the parallelism for the
Hi Dave,
This is HBase solution to the poor scan performance issue:
https://issues.apache.org/jira/browse/HBASE-8369
I encountered the same issue before.
To the best of my knowledge, this is not a mapreduce issue. It is hbase
issue. If you are planning to swap out mapreduce and replace it with
Also Driver can run on one of the slave nodes. (you will stil need a spark
master though for resource allocation etc).
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Apr 8, 2014 at 2:46 PM, Nan Zhu
I am pretty new to Spark and I am trying to run the spark shell on a Yarn
cluster from the cli (in yarn-client mode). I am able to start the shell with
the following command:
SPARK_JAR=../spark-0.9.0-incubating/jars/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar
\ SPARK_YARN_APP_JAR=emptyfile
Never mind...plz return it later with interest
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p4014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark Streaming job was running on two worker nodes and then there was an
error on one of the nodes. The spark job showed running but no progress was
being made and not processing any new messages. Based on the driver log
files I see the following errors.
I would expect the stream reading would
Howdy.
Is it possible to initiate Spark jobs from Oozie (presumably as a java action)?
If so, are there known limitations to this? And would anybody have a pointer
to an example?
Thanks,
Nate
Thanks Prabeesh.
On Wed, Apr 9, 2014 at 12:37 AM, prabeesh k prabsma...@gmail.com wrote:
Please refer
http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html
Regards,
prabeesh
On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji
pradeep.chanum...@gmail.comwrote:
Hi everyone,
We have just posted Spark 0.9.1, which is a maintenance release with
bug fixes, performance improvements, better stability with YARN and
improved parity of the Scala and Python API. We recommend all 0.9.0
users to upgrade to this stable release.
This is the first release since Spark
A small additional note: Please use the direct download links in the Spark
Downloads http://spark.apache.org/downloads.html page. The Apache mirrors
take a day or so to sync from the main repo, so may not work immediately.
TD
On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das
Hi all,
I have been able to run LR in local mode, but I am facing problem to run
it in cluster mode, below is the source script, and stack trace when
running it cluster mode, I used sbt package to build the project, not sure
what it is complaining?
another question I have is for
Hi,
I want to enable Spark Master HA in spark. Documentation specifies that we
can do this with the help of Zookeepers. But what I am worried is how to
configure one master with the other and similarly how do workers know that
the have two masters? where do you specify the multi-master
Thanks TD for managing this release, and thanks to everyone who contributed!
Matei
On Apr 9, 2014, at 2:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote:
A small additional note: Please use the direct download links in the Spark
Downloads page. The Apache mirrors take a day or so to
Hi Jenny,
How are you packaging your jar.
Can you please confirm if you have included the Mlib jar inside the fat jar
you have created for your code.
libraryDependencies += org.apache.spark % spark-mllib_2.9.3 %
0.8.1-incubating
Thanks,
Jagat Singh
On Thu, Apr 10, 2014 at 8:05 AM, Jenny
The only way i know to do this is to use mesos with zookeepers. you specify
zookeeper url as spark url that contains multiple zookeeper hosts. Multiple
mesos masters are then elected thru zookeeper leader election until current
leader dies; at which point mesos will elect another master (if still
Hey Patrick,
I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to
track this request, in case the team/community wants to implement it in the
future.
Nick
On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
No use case at the moment.
Thanks Dmitriy. But I want multi master support when running spark
standalone. Also I want to know if this multi master thing works if I use
spark-shell.
On Wed, Apr 9, 2014 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
The only way i know to do this is to use mesos with zookeepers.
Hi Matei, thanks for working with me to find these issues.
To summarize, the issues I've seen are:
0.9.0:
- https://issues.apache.org/jira/browse/SPARK-1323
SNAPSHOT 2014-03-18:
- When persist() used and batchSize=1, java.lang.OutOfMemoryError:
Java heap space. To me this indicates a memory
Hi Jagat,
yes, I did specify mllib in build.sbt
name := Spark LogisticRegression
version :=1.0
scalaVersion := 2.10.3
libraryDependencies += org.apache.spark % spark-core_2.10 %
0.9.0-incubating
libraryDependencies += org.apache.spark % spark-mllib_2.10 %
0.9.0-incubating
A very nice addition for us PySpark users in 0.9.1 is the addition of
RDD.repartition(), which is not mentioned in the release
noteshttp://spark.apache.org/releases/spark-release-0-9-1.html
!
This is super helpful for when you create an RDD from a gzipped file and
then need to explicitly shuffle
Okay, thanks. Do you have any info on how large your records and data file are?
I’d like to reproduce and fix this.
Matei
On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote:
Hi Matei, thanks for working with me to find these issues.
To summarize, the issues I've seen are:
I set SPARK_MEM in the driver process by setting
spark.executor.memory to 10G. Each machine had 32G of RAM and a
dedicated 32G spill volume. I believe all of the units are in pages,
and the page size is the standard 4K. There are 15 slave nodes in the
cluster and the sizes of the datasets I'm
Ah, looks good now. It took me a minute to realize that doing a hard
refresh on the docs page was missing the RDD class doc page...
And thanks for updating the release notes.
On Wed, Apr 9, 2014 at 7:21 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
Thanks Nick for pointing that out! I
Good question. This is something we wanted to fix, but unfortunately I'm
not sure how to do it without changing the API to RDD, which is undesirable
now that the 1.0 branch has been cut. We should figure something out though
for 1.1.
I've created https://issues.apache.org/jira/browse/SPARK-1460
It is as Jagat said. The Masters do not need to know about one another, as
ZooKeeper manages their implicit communication. As for Workers (and
applications, such as spark-shell), once a Worker is registered with
*some *Master,
its metadata is stored in ZooKeeper such that if another Master is
thank you, it works
after my operation over p, return p.toIterator, because mapPartitions has
iterator return type, is that right?
rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}
--
View this message in context:
Yeah, should be right
--
Nan Zhu
On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote:
thank you, it works
after my operation over p, return p.toIterator, because mapPartitions has
iterator return type, is that right?
rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}
This dataset is uncompressed text at ~54GB. stats() returns (count:
56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
343)
On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Okay, thanks. Do you have any info on how large your records and data file
42 matches
Mail list logo