I am a PhD student working on a research project related to Apache Spark. I
am trying to modify some of the spark source code such that instead of
sending the final result RDD from the worker nodes to a master node, I want
to send the final result RDDs to some different node. In order to do this,
Hi all,
I have been working hard to make it easier for developers to make
community contributions to the Spark GraphX algorithm library. At the core
of this I found that the Pregel API is a difficult concept to understand
and I think I can help make it better.
Can you please review
Hi List,
I'm following this example here
https://github.com/databricks/learning-spark/tree/master/mini-complete-example
with the following:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master spark://host.domain.ex:7077 \
--class
Thanks for the response. I'll admit I'm rather new to Mesos. Due to the
nature of my setup I can't use the Mesos web portal effectively because I'm
not connected by VPN, so the local network links from the mesos-master
dashboard I SSH tunnelled aren't working.
Anyway, I was able to dig up some
Hi,
I think for local mode, the number N (N number of thread) basically equals
to N number of available cores in ONE executor(worker), not N workers. You
could image local[N] as have one worker with N cores. I'm not sure you
could set the memory usage for each thread, for Spark the memory is
For Gradle, there are:
https://github.com/musketyr/gradle-fatjar-plugin
https://github.com/johnrengelman/shadow
FYI
On Sun, Mar 29, 2015 at 4:29 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
thanks for posting this! Ive ran into similar issues before, and generally
its a bad idea to swap
LGTM. Could you open a JIRA and send a PR? Thanks.
Best Regards,
Shixiong Zhu
2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
I looked @ the 1.3.0 code and figured where this can be added
In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is
actorSystem =
Has anyone else faced this issue of running spark-shell (yarn client mode) in
an environment with strict firewall rules (on fixed allowed incoming ports)?
How can this be rectified?
Thanks,
Manish
From: Manish Gupta 8
Sent: Thursday, March 26, 2015 4:09 PM
To: user@spark.apache.org
Subject:
Hi guys,
I am using SparkStreaming to receive message from kafka,process it
and then send back to kafka. however ,kafka consumer can not receive any
messages. Any one share ideas?
here is my code:
object SparkStreamingSampleDirectApproach {
def main(args: Array[String]): Unit =
Hi,
I have noticed in the Spark UI, workers and executors run on several
states, ALIVE, LOADING, RUNNING, DEAD etc?
What exactly are these states mean and what is the effect it has on working
with those executor?
ex: whether an executor can not be used in the loading state, etc
cheers
--
A LOADING Executor is on the way to RUNNING, but hasn't yet been registered
with the Master, so it isn't quite ready to do useful work.
On Mar 29, 2015, at 9:09 PM, Niranda Perera niranda.per...@gmail.com wrote:
Hi,
I have noticed in the Spark UI, workers and executors run on several
Thanks to you again, Sean.
The thing is that, we persist and count that RDD in hope that all later
actions with it won't trigger previous recalculations, it's not really
about performance here, it's because recalculations contain UUID generation
which should be the same for further actions.
I
Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html
Thanks
Best Regards
On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:
Hi All,
I'm facing performance issues with spark implementation, and was briefly
investigating on
Yes, that worked - thank you very much.
On Sun, Mar 29, 2015 at 9:05 AM Ted Yu yuzhih...@gmail.com wrote:
Jenkins build failed too:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull
For the moment, you
Depends on the task that you are doing. I think What is displayed on the
webui (attached pic) is for your complete application.
Thanks
Best Regards
On Fri, Mar 27, 2015 at 8:28 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid
wrote:
Hi,
Is shuffle read and write is for the complete streaming
Don't call .collect if your data size huge, you can simply do a count() to
trigger the execution.
Can you paste your exception stack trace so that we'll know whats happening?
Thanks
Best Regards
On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth toth.zsolt@gmail.com
wrote:
Hi,
I have a simple
You need set your master as local[2], 2 is the minimum number of threads
(in localmode) needed to consume and process a Streaming application.
Thanks
Best Regards
On Sun, Mar 29, 2015 at 12:56 PM, mehak.soni mehak.s...@gmail.com wrote:
I am trying to run the NetworkWordCount.java in Spark
Try these:
- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:
sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)
Also see, whats really
I tried pulling the source and building for the first time, but cannot get
past the object NoRelation is not a member of package
org.apache.spark.sql.catalyst.plans.logical error below on the 1.3 branch.
I can build the 1.2 branch.
I have tried with both -Dscala-2.11 and 2.10 (after running the
Nathan:
Please look in log files for any of the following:
doCleanupRDD():
case e: Exception = logError(Error cleaning RDD + rddId, e)
doCleanupShuffle():
case e: Exception = logError(Error cleaning shuffle + shuffleId,
e)
doCleanupBroadcast():
case e: Exception =
Jenkins build failed too:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull
For the moment, you can apply the following change:
diff --git
You may simply pass in JavaSparkContext.sc
On 3/29/15 9:25 PM, Vincent He wrote:
All,
I try Spark SQL with Java, I find HiveContext does not accept
JavaSparkContext, is this true? Or any special build of Spark I need
to do (I build with Hive and thrift server)? Can we use HiveContext in
No luck, it does not work, anyone know whether there some special setting
for spark-sql cli so we do not need to write code to use spark sql? Anyone
have some simple example on this? appreciate any help. thanks in advance.
On Sat, Mar 28, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:
See
thanks .
It does not work, and can not pass compile as HiveContext constructor does
not accept JaveSparkContext and JaveSparkContext is not subclass of
SparkContext.
Anyone else have any idea? I suspect this is supported now.
On Sun, Mar 29, 2015 at 8:54 AM, Cheng Lian lian.cs@gmail.com
All,
I try Spark SQL with Java, I find HiveContext does not accept
JavaSparkContext, is this true? Or any special build of Spark I need to do
(I build with Hive and thrift server)? Can we use HiveContext in Java?
thanks in advance.
I mean JavaSparkContext has a field name sc, whose type is
SparkContext. You may pass this sc to HiveContext.
On 3/29/15 9:59 PM, Vincent He wrote:
thanks .
It does not work, and can not pass compile as HiveContext constructor
does not accept JaveSparkContext and JaveSparkContext is not
How does one consume parameters passed to a Scala script via spark-shell
-i?
1. If I use an object with a main() method, the println outputs nothing as
if not called:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object Test {
persist() completes immediately since it only marks the RDD for
persistence. count() triggers computation of rdd, and as rdd is
computed it will be persisted. The following transform should
therefore only start after count() and therefore after the persistence
completes. I think there might be
Hi Burak,
Unfortunately, I am expected to do my work in HDInsight environment which
only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace
it with Spark 1.3.
I think the problem I am observing is caused by kmeans|| initialization
step. I will open another thread to discuss
Sean
Thank you very much for your response. I have a requirement run a function
only over the new inputs in a Spark Streaming sliding window, i.e. the
latest batch of events only, do I just get a new Dstream using the slide
duration equal to the window duration ? such as
val sparkConf = new
Hi.
rdd.persist()
rdd.count()
rdd.transform()...
is there a chance transform() runs before persist() is complete?
--
RGRDZ Harut
You're just describing the operation of Spark Streaming at its
simplest, without windowing. You get non-overlapping RDDs of the most
recent data each time.
On Sun, Mar 29, 2015 at 8:44 AM, Deenar Toraskar
deenar.toras...@gmail.com wrote:
Sean
Thank you very much for your response. I have a
I figured out that Logging is a DeveloperApi and it should not be used outside
Spark code, so everything is fine now. Thanks again, Marcelo.
On 24 Mar 2015, at 20:06, Marcelo Vanzin van...@cloudera.com wrote:
From the exception it seems like your app is also repackaging Scala
classes
Hi,
My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but
spark uses 4.2.6 , and I believe thats what's causing the following error.
I've tried setting
spark.executor.userClassPathFirst spark.driver.userClassPathFirst to true
in the config, but that does not solve it either.
I am trying to run the NetworkWordCount.java in Spark streaming examples. I
was able to run it using run-example.
I was now trying to run the same code from an app I created. This is the
code- it looks pretty much similar to the existing code:
import scala.Tuple2;
import
Hi,
I have opened a couple of threads asking about k-means performance problem
in Spark. I think I made a little progress.
Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
uses the kmeans|| initialization algorithm which supposedly to be a
faster version of kmeans++ and
I don't think you can guarantee that there is no recomputation. Even
if you persist(), you might lose the block and have to recompute it.
You can persist your UUIDs to storage like HDFS. They won't change
then of course. I suppose you still face a much narrower problem, that
the act of computing
I left a comment on your stackoverflow earlier. Can you share what's the output
in your stderr log from your Mesos task? It
Can be found in your Mesos UI and going to its sandbox.
Tim
Sent from my iPhone
On Mar 29, 2015, at 12:14 PM, seglo wla...@gmail.com wrote:
The latter part of this
Hi,
I am trying to use Spark for my own applications, and I am currently
profiling the performance with local mode, and I have a couple of questions:
1. When I set spark.master local[N], it means the will use up to N worker
*threads* on the single machine. Is this equivalent to say there are N
Mesosphere did a great job on simplifying the process of running Spark on
Mesos. I am using this guide to setup a development Mesos cluster on Google
Cloud Compute.
https://mesosphere.com/docs/tutorials/run-spark-on-mesos/
I can run the example that's in the guide by using spark-shell (finding
The latter part of this question where I try to submit the application by
referring to it on HDFS is very similar to the recent question
Spark-submit not working when application jar is in hdfs
I pushed a hotfix to the branch. Should work now.
On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower sp...@mjhb.com wrote:
Yes, that worked - thank you very much.
On Sun, Mar 29, 2015 at 9:05 AM Ted Yu yuzhih...@gmail.com wrote:
Jenkins build failed too:
Hi Wisely,
I am running on Amazon EC2 instances so I can not doubt the hardware.
Moreover my other pipelines run successfully except for this which involves
Broadcasting large object.
My spark-en.sh setting are:
SPARK_MASTER_IP=MASTER-IP
SPARK_LOCAL_IP=LOCAL-IP
SPARK_DRIVER_MEMORY=24g
/20150329-232522-84118794-5050-18181-/executors/5/runs/latest/stderr
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-run-spark-submit-with-an-application-jar-on-a-Mesos-cluster-tp22277p22280.html
Sent from the Apache Spark User List mailing list archive
Hi Sean,
I think your point about the ETL costs are the wining argument here. but I
would like to see more research on the topic.
What I would like to see researched - is ability to run a specialized set
of common algorithms in fast-local-mode just like a compiler optimizer
can decide to inline
Thanks Michael!
Can you please point me to the docs / source location for that automatic
casting? I'm just using it to extract the data and put it in a Map[String,
Any] (long story on the reason...) so I think the casting rules won't
know what to cast it to... right? I guess I can have the JSON /
thanks for posting this! Ive ran into similar issues before, and generally
its a bad idea to swap the libraries out and pray fot the best, so the
shade functionality is probably the best feature.
Unfortunately, im not sure how well SBT and Gradle support shading... how
do folks using next gen
Made it work by using yarn-cluster as master instead of local.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22281.html
Sent from the Apache Spark User List mailing list archive at
48 matches
Mail list logo