Tobias,
Understand and thanks for quick resolution of problem.
Thanks
~Rahul
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20446.html
Sent from the Apache Spark User List mailing list
Is there any tool to profile GraphX codes in a cluster? Is there a way to
know the messages exchanged among the nodes in a cluster?
WebUI does not give all the information.
Thank You
Hi,
1) yes you can. Spark is supporting a lot of file formats on hdfs/s3 then is
supporting cassandra and jdbc in General.
2) yes. Spark has a jdbc thrift server where you can attach BI tools. I suggest
to you to pay attention to your Query response time requirements.
3) no you can go with
Hi, all:
According to https://github.com/apache/spark/pull/2732, When a spark job fails
or exits nonzero in yarn-cluster mode, the spark-submit will get the
corresponding return code of the spark job. But I tried in spark-1.1.1 yarn
cluster, spark-submit return zero anyway.
Here is my spark
I tried in spark client mode, spark-submit can get the correct return code from
spark job. But in yarn-cluster mode, It failed.
From: lin_q...@outlook.com
To: u...@spark.incubator.apache.org
Subject: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in
yarn-cluster mode
Date: Fri, 5
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write
you own aggregation with |aggregateByKey|:
|users.aggregateByKey((0,Set.empty[String]))({case ((count, seen), user) =
(count +1, seen + user)
}, {case ((count0, seen0), (count1, seen1)) =
(count0 + count1, seen0 ++
Hi,
I don’t think it’s a problem of Spark Streaming, seeing for call stack, it’s
the problem when BlockManager starting to initializing itself. Would you mind
checking your configuration of Spark, hardware problem, deployment. Mostly I
think it’s not the problem of Spark.
Thanks
Saisai
From:
I tried anather test code: def main(args: Array[String]) {if (args.length
!= 1) { Util.printLog(ERROR, Args error - arg1: BASE_DIR)
exit(101) }val currentFile = args(0).toStringval DB = test_spark
val tableName = src
val sparkConf = new
What's the status of this application in the yarn web UI?
Best Regards,
Shixiong Zhu
2014-12-05 17:22 GMT+08:00 LinQili lin_q...@outlook.com:
I tried anather test code:
def main(args: Array[String]) {
if (args.length != 1) {
Util.printLog(ERROR, Args error - arg1: BASE_DIR)
Hi all,
I've tried the above example on Gist, but it doesn't work (at least for me).
Did anyone get this:
14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class
Hello,
By some (unknown) reasons some of my tasks, that fetch data from Cassandra,
are failing so often, and apparently the master removes a tasks which fails
more than 4 times (in my case).
Is there any way to increase the number of re-tries ?
best,
/Shahab
Hi,
I have a graph in where each vertex keep several messages to some faraway
neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k
= 5).
now, I propose to distribute these messages to their corresponding
destinations(say, faraway neighbours”):
- by using pregel
Hi,
I am using SparkSQL on 1.1.0 branch.
The following code leads to a scala.MatchError
at
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
val scm = StructType(inputRDD.schema.fields.init :+
StructField(list,
ArrayType(
StructType(
Here's an example of a Cassandra etl that you can follow which should exit on
its own. I'm using it as a blueprint for revolving spark streaming apps on top
of.
For me, I kill the streaming app w system.exit after a sufficient amount of
data is collected.
That seems to work for most any
I worked man.. Thanks alot :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html
Sent from the Apache Spark User List mailing list archive at
Hello,
I'm currently developing a Spark Streaming application and trying to write
my first unit test. I've used Java for this application, and I also need
use Java (and JUnit) for writing unit tests.
I could not find any documentation that focuses on Spark Streaming unit
testing, all I could
Hi, Alexey,
I'm getting the same error on startup with Spark 1.1.0. Everything works
fine fortunately.
The error is mentioned in the logs in
https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be
fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately.
On Tue,
Any thoughts, how could Spark SQL help in our scenario ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20465.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can just do
You can just do something like this, the Spark Cassandra Connector handles the
rest
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Map(KafkaTopicRaw - 10), StorageLevel.DISK_ONLY_2)
.map { case (_, line) = line.split(,)}
Hi,
using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I
can seen multiple python processes spawned on each nodemanager but from some
reason when running cartesian there is only single python process running on
each node. the task is indicating thousands of partitions
It is controlled by spark.task.maxFailures. See
http://spark.apache.org/docs/latest/configuration.html#scheduling.
On Fri, Dec 5, 2014 at 11:02 AM, shahab shahab.mok...@gmail.com wrote:
Hello,
By some (unknown) reasons some of my tasks, that fetch data from
Cassandra, are failing so often,
Generally I don't think frequent-item-set algorithms are that useful.
They're simple and not probabilistic; they don't tell you what sets
occurred unusually frequently. Usually people ask for frequent item
set algos when they really mean they want to compute item similarity
or make
hi,
What is the bet way to process a batch window in SparkStreaming :
kafkaStream.foreachRDD(rdd = {
rdd.collect().foreach(event = {
// process the event
process(event)
})
})
Or
kafkaStream.foreachRDD(rdd = {
rdd.map(event = {
//
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Rahul,
On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
I have done so thats why spark is able to load objectfile [e.g.
person_obj]
and spark has maintained serialVersionUID
Please specify '-DskipTests' on commandline.
Cheers
On Dec 5, 2014, at 3:52 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
I'm currently developing a Spark Streaming application and trying to write my
first unit test. I've used Java for this application, and I also need use
Java
I would like to subscribe to the
user@spark.apache.orgmailto:user@spark.apache.org
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
Hi Ningjun
Please send email to this address to get subscribed:
user-subscr...@spark.apache.org
On Dec 5, 2014, at 10:36 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I would like to subscribe to the user@spark.apache.org
Regards,
Ningjun Wang
Consulting Software
Hi all,
I'm trying to run some spark job with spark-shell. What I want to do is
just to count the number of lines in a file.
I start the spark-shell with the default argument i.e just with
./bin/spark-shell.
Load the text file with sc.textFile(path) and then call count on my data.
When I do
How big is your file? it's probably of a size that the Hadoop
InputFormat would make 52 splits for it. Data drives partitions, not
processing resource. Really, 8 splits is the minimum parallelism you
want. Several times your # of cores is better.
On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa
Hi Akhil, Vyas, Helena,
Thank you for your suggestions.
As Akhil suggested earlier, i have implemented the batch Duration into
JavaStreamingContext and waitForTermination(Duration).
The approach Helena suggested is Scala oriented.
But the issue now is that I want to set Cassandra as my output.
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G
big and with less bigger file I have a different partitions size. Thanks
for this clarification.
On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen so...@cloudera.com wrote:
How big is your file? it's probably of a size that the
I think what you are looking for is something like:
JavaRDDDouble pricesRDD = javaFunctions(sc).cassandraTable(ks, tab,
mapColumnTo(Double.class)).select(price);
JavaRDDPerson rdd = javaFunctions(sc).cassandraTable(ks, people,
mapRowTo(Person.class));
noted here:
Hi all,
I'm trying to a run clustering with kmeans algorithm. The size of my data
set is about 240k vectors of dimension 384.
Solving the problem with the kmeans available in julia (kmean++)
http://clusteringjl.readthedocs.org/en/latest/kmeans.html
take about 8 minutes on a single core.
Hello,
Specifying '-DskipTests' on commandline worked, though I can't be sure
whether first running 'sbt assembly' also contributed to the solution.
(I've tried 'sbt assembly' because branch-1.1's README says to use sbt).
Thanks for the answer.
Kind regards,
Emre Sevinç
Spark has much more overhead, since it's set up to distribute the
computation. Julia isn't distributed, and so has no such overhead in a
completely in-core implementation. You generally use Spark when you
have a problem large enough to warrant distributing, or, your data
already lives in a
Hi ,
Is it possible to catch exceptions using pyspark so in case of error, the
program will not fail and exit.
for example if I am using (key, value) rdd functionality but the data don't
have actually (key, value) format, pyspark will throw exception (like
ValueError) that I am unable to catch.
Thank you Helena,
But I would like to explain my problem space:
The output is supposed to be Cassandra. To achieve that, I have to use
spark-cassandra-connecter APIs.
So going in a botton-up approach, to write to cassandra, I have to use:
javaFunctions(JavaRDD object rdd,
This is a typical use case people who buy electric razors, also tend to
buy batteries and shaving gel along with it. The goal is to build a model
which will look through POS records and find which product categories have
higher likelihood of appearing together in given a transaction.
What would
Could you post you script to reproduce the results (also how to
generate the dataset)? That will help us to investigate it.
On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hmm, here I use spark on local mode on my laptop with 8 cores. The data is
on my local
Is there a way I can have a JDBC connection open through a streaming job. I
have a foreach which is running once per batch. However, I don’t want to
open the connection for each batch but would rather have a persistent
connection that I can reuse. How can I do this?
Thanks.
Asim
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Seconds is not part of org.apache.spark.streaming and object Receiver is not a
member of package org.apache.spark.streaming.receiver. If I
The code is really simple :
*object TestKMeans {*
* def main(args: Array[String]) {*
*val conf = new SparkConf()*
* .setAppName(Test KMeans)*
* .setMaster(local[8])*
* .set(spark.executor.memory, 8g)*
*val sc = new SparkContext(conf)*
*val numClusters = 500;*
*
I've done this:
1. foreachPartition
2. Open connection.
3. foreach inside the partition.
4. close the connection.
Slightly crufty, but works. Would love to see a better approach.
Regards,
Ashic.
Date: Fri, 5 Dec 2014 12:32:24 -0500
Subject: Spark Streaming Reusing JDBC Connections
From:
Hi
Could any one help what would be better / optimized configuration for driver
memory, worker memory, number of parallelisms etc., parameters to be
configured when we are running 1 master node (it itself acting as slave node
also) and 1 slave node. Both are of 32 GB RAM with 4 cores.
On this, I
I am using a custom hadoop input format which works well on smaller files
but fails with a file at about 4GB size - the format is generating about
800 splits and all variables in my code are longs -
Any suggestions? Is anyone reading files of this size?
Exception in thread main
Can you try with maven ?
diff --git a/streaming/pom.xml b/streaming/pom.xml
index b8b8f2e..6cc8102 100644
--- a/streaming/pom.xml
+++ b/streaming/pom.xml
@@ -68,6 +68,11 @@
artifactIdjunit-interface/artifactId
scopetest/scope
/dependency
+dependency
+
Hey Sourav, are you able to run a simple shuffle in a spark-shell?
2014-12-05 1:20 GMT-08:00 Shao, Saisai saisai.s...@intel.com:
Hi,
I don’t think it’s a problem of Spark Streaming, seeing for call stack,
it’s the problem when BlockManager starting to initializing itself. Would
you mind
Hey, the default port is 7077. Not sure if you actually meant to put 7070.
As a rule of thumb, you can go to the Master web UI and copy and paste the
URL at the top left corner. That almost always works unless your cluster
has a weird proxy set up.
2014-12-04 14:26 GMT-08:00 Xingwei Yang
Hey Tobias,
As you suspect, the reason why it's slow is because the resource manager in
YARN takes a while to grant resources. This is because YARN needs to first
set up the application master container, and then this AM needs to request
more containers for Spark executors. I think this accounts
At 2014-12-05 02:26:52 -0800, Yifan LI iamyifa...@gmail.com wrote:
I have a graph in where each vertex keep several messages to some faraway
neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g.
k = 5).
now, I propose to distribute these messages to their
Hi Tobias,
What version are you using? In some recent versions, we had a couple of
large hardcoded sleeps on the Spark side.
-Sandy
On Fri, Dec 5, 2014 at 11:15 AM, Andrew Or and...@databricks.com wrote:
Hey Tobias,
As you suspect, the reason why it's slow is because the resource manager
If you're only interested in a particular instant, a simpler way is to
check the executors page on the Spark UI:
http://spark.apache.org/docs/latest/monitoring.html. By default each
executor runs one task per core, so you can see how many tasks are being
run at a given time and this translates
Sorry...really don't have enough maven know how to do this quickly. I tried the
pom below, and IntelliJ could find org.apache.spark.streaming.StreamingContext
and org.apache.spark.streaming.Seconds, but not
org.apache.spark.streaming.receiver.Receiver. Is there something specific I can
try?
Hey Stuti,
Did you start your standalone Master and Workers? You can do this through
sbin/start-all.sh (see
http://spark.apache.org/docs/latest/spark-standalone.html). Otherwise, I
would recommend launching your application from the command line through
bin/spark-submit. I am not sure if we
I'm a bit confused regarding expected behavior of unions. I'm running on 8
cores. I have an RDD that is used to collect cluster associations (cluster id,
content id, distance) for internal clusters as well as leaf clusters since I'm
doing hierarchical k-means and need all distances for sorting
Hi Steve et al.,
It is possible that there's just a lot of skew in your data, in which case
repartitioning is a good idea. Depending on how large your input data is
and how much skew you have, you may want to repartition to a larger number
of partitions. By the way you can just call
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps.
If I was running this on standalone cluster mode the query finished in 55s
but on YARN, the query was still running 30min later. Would the hard coded
sleeps potentially be in play here?
On Fri, Dec 5, 2014 at 11:23 Sandy
Increasing max failures is a way to do it, but it's probably a better idea
to keep your tasks from failing in the first place. Are your tasks failing
with exceptions from Spark or your application code? If from Spark, what is
the stack trace? There might be a legitimate Spark bug such that even
The command run fine for me on master. Note that Hive does print an
exception in the logs, but that exception does not propogate to user code.
On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got exception saying Hive: NoSuchObjectException(message:table
And that is no different from how Hive has worked for a long time.
On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust mich...@databricks.com
wrote:
The command run fine for me on master. Note that Hive does print an
exception in the logs, but that exception does not propogate to user code.
It does not appear that the in-memory caching currently preserves the
information about the partitioning of the data so this optimization will
probably not work.
On Thu, Dec 4, 2014 at 8:42 PM, nitin nitin2go...@gmail.com wrote:
With some quick googling, I learnt that I can we can provide
Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a
1-node m3.xlarge EC2 instance instance and the app finishes running
successfully in about 40 seconds. I just figured the 30 - 40 sec run time
was normal b/c of the submitting overhead that Andrew mentioned.
Denny, you can
Hi Denny,
Those sleeps were only at startup, so if jobs are taking significantly
longer on YARN, that should be a different problem. When you ran on YARN,
did you use the --executor-cores, --executor-memory, and --num-executors
arguments? When running against a standalone cluster, by default
No, RDDs are immutable. union() creates a new RDD, and does not modify
an existing RDD. Maybe this obviates the question. I'm not sure what
you mean about releasing from memory. If you want to repartition the
unioned RDD, you repartition the result of union(), not anything else.
On Fri, Dec 5,
Hi Ron,
Out of curiosity, why do you think that union is modifying an existing RDD
in place? In general all transformations, including union, will create new
RDDs, not modify old RDDs in place.
Here's a quick test:
scala val firstRDD = sc.parallelize(1 to 5)
firstRDD:
Hey Sandy,
What are those sleeps for and do they still exist? We have seen about a
1min to 1:30 executor startup time, which is a large chunk for jobs that
run in ~10min.
Thanks,
Arun
On Fri, Dec 5, 2014 at 3:20 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Denny,
Those sleeps were only
Likely this not the case here yet one thing to point out with Yarn
parameters like --num-executors is that they should be specified *before*
app jar and app args on spark-submit command line otherwise the app only
gets the default number of containers which is 2.
On Dec 5, 2014 12:22 PM, Sandy
Hey Arun,
The sleeps would only cause maximum like 5 second overhead. The idea was
to give executors some time to register. On more recent versions, they
were replaced with the spark.scheduler.minRegisteredResourcesRatio and
spark.scheduler.maxRegisteredResourcesWaitingTime. As of 1.1, by
foreach also creates a new RDD, and does not modify an existing RDD.
However, in practice, nothing stops you from fiddling with the Java
objects inside an RDD when you get a reference to them in a method
like this. This is definitely a bad idea, as there is certainly no
guarantee that any other
What kind of Query are you performing?
You should set something like 2 partition per core that would be 400 Mb per
partition.
As you have a lot of ram I suggest to cache the whole table, performance will
increase a lot.
Paolo
Inviata dal mio Windows Phone
Da:
All values in Hive are always nullable, though you should still not be
seeing this error.
It should be addressed by this patch:
https://github.com/apache/spark/pull/3150
On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren inv...@gmail.com wrote:
Hi,
I am using SparkSQL on 1.1.0 branch.
The following
I'm experiencing the same problem when I try to run my app in a standalone
Spark cluster.
My use case, however, is closer to the problem documented in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Please-help-running-a-standalone-app-on-a-Spark-cluster-td1596.html.
The
Also, are you using the latest master in this experiment? A PR merged
into the master couple days ago will spend up the k-means three times.
See
https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1
Sincerely,
DB Tsai
Can you try to run the same job using the assembly packaged by
make-distribution as we discussed in the other thread.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Fri, Dec 5, 2014 at
Hi,
The following example code is able to build the correct model.weights, but its
prediction value is zero. Am I passing the PredictOnValues incorrectly? I
also coded a batch version base on LinearRegressionWithSGD() with the same
train and test data, iteration, stepsize info, and it was
I've read in the documentation that RDDs can be run concurrently when
submitted in separate threads. I'm curious how the scheduler would handle
propagating these down to the tasks.
I have 3 RDDs:
- one RDD which loads some initial data, transforms it and caches it
- two RDDs which use the cached
You can set SPARK_PREPEND_CLASSES=1 and it should pick your new mllib
classes whenever you compile them.
I don't see anything similar for examples/, so if you modify example
code you need to re-build the examples module (package or install
- just compile won't work, since you need to build the
Apriori can be thought as a post-processing on product similarity graph...I
call it product similarity but for each product you build a node which
keeps distinct users visiting the product and two product nodes are
connected by an edge if the intersection 0...you are assuming if no one
user
I use Spark in Java.
I want to access the vectors of RowMatrix M, thus I use M.rows(), which is
a RDDVector
I want to transform it to JavaRDDVector, I used the following command;
JavaRDDVector data = JavaRDD.fromRDD(M.rows(),
scala.reflect.ClassTag$.MODULE$.apply(Vector.class);
However, it
Yeah the main way to do this would be to have your own static cache of
connections. These could be using an object in Scala or just a static
variable in Java (for instance a set of connections that you can
borrow from).
- Patrick
On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp
i suddenly also run into the issue that maven is trying to download
snapshots that dont exists for other sub projects.
did something change in the maven build?
does maven not have capability to smartly compile the other sub-projects
that a sub-project depends on?
i rather avoid mvn install
I see. The resulting SchemaRDD is returned so like Michael said, the
exception does not propogate to user code.
However printing out the following log is confusing :)
scala sql(drop table if exists abc)
14/12/05 16:27:02 INFO ParseDriver: Parsing command: drop table if exists
abc
14/12/05
Hi,
I had to use Pig for some preprocessing and to generate Parquet files for
Spark to consume.
However, due to Pig's limitation, the generated schema contains Pig's
identifier
e.g.
sorted::id, sorted::cre_ts, ...
I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
create external
i think what changed is that core now has dependencies on other sub
projects. ok... so i am forced to install stuff because maven cannot
compile what is needed. i will install
On Fri, Dec 5, 2014 at 7:12 PM, Koert Kuipers ko...@tresata.com wrote:
i suddenly also run into the issue that maven is
Maven definitely compiles what is needed, but not if you tell it to
only compile one module alone. Unless you have previously built and
installed the other local snapshot artifacts it needs, that invocation
can't proceed because you have restricted it to build one module whose
dependencies don't
You can probably get around it with casting, but I ended up using
wrapRDD -- which is not a static method -- from another JavaRDD in
scope to address this more directly without casting or warnings. It's
not ideal but both should work, just a matter of which you think is
less hacky.
On Fri, Dec 5,
I've never used it, but reading the help it seems the -am option
might help here.
On Fri, Dec 5, 2014 at 4:47 PM, Sean Owen so...@cloudera.com wrote:
Maven definitely compiles what is needed, but not if you tell it to
only compile one module alone. Unless you have previously built and
I tried the following:
511 rm -rf
~/.m2/repository/org/apache/spark/spark-core_2.10/1.3.0-SNAPSHOT/
513 mvn -am -pl streaming package -DskipTests
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [4.976s]
[INFO] Spark Project Networking
Here's the solution I got after talking with Liancheng:
1) using backquote `..` to wrap up all illegal characters
val rdd = parquetFile(file)
val schema = rdd.schema.fields.map(f = s`${f.name}`
${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)
val ddl_13 = s
I am trying to look at problems reading a data file over 4G. In my testing
I am trying to create such a file.
My plan is to create a fasta file (a simple format used in biology)
looking like
1
TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG
2
Try the workaround for Windows found here:
http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7.
This fix the issue when calling rdd.saveAsTextFile(..) for me with Spark
v1.1.0 on windows 8.1 in local mode.
Summary of steps:
1) download compiled winutils.exe from
There were two exit in this code. If the args was wrong, the spark-submit
will get the return code 101, but, if the args is correct, spark-submit
cannot get the second return code 100. What’s the difference between these
two exit? I was so confused.
I’m also confused. When I tried your codes,
I'm trying to understand the conceptual difference between these two
configurations in term of performance (using Spark standalone cluster)
Case 1:
1 Node
60 cores
240G of memory
50G of data on local file system
Case 2:
6 Nodes
10 cores per node
40G of memory per node
50G of data on HDFS
nodes
It's an easy mistake to make... I wonder if an assertion could be
implemented that makes sure the type parameter is present.
We could use the NotNothing pattern
http://blog.evilmonkeylabs.com/2012/05/31/Forcing_Compiler_Nothing_checks/
but I wonder if it would just make the method signature
That depends! See inline. I am assuming that when you said replacing
local disk with HDFS in case 1, you are connected to a separate HDFS
cluster (like case 1) with a single 10G link. Also assumign that all
nodes (1 in case 1, and 6 in case 2) are the worker nodes, and the
spark application
Getting this on the home machine as well. Not referencing the spark cassandra
connector in libraryDependencies compiles.
I've recently updated IntelliJ to 14. Could that be causing an issue?
From: as...@live.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org
Subject: RE: Adding Spark
Sorry for the delay in my response - for my spark calls for stand-alone and
YARN, I am using the --executor-memory and --total-executor-cores for the
submission. In standalone, my baseline query completes in ~40s while in
YARN, it completes in ~1800s. It does not appear from the RM web UI that
Okay, my bad for not testing out the documented arguments - once i use the
correct ones, the query shrinks completes in ~55s (I can probably make it
faster). Thanks for the help, eh?!
On Fri Dec 05 2014 at 10:34:50 PM Denny Lee denny.g@gmail.com wrote:
Sorry for the delay in my response
Hi -
I understand that one can use spark.deploy.defaultCores and spark.cores.max
to assign a fixed number of worker cores to different apps. However, instead of
statically assigning the cores, I would like Spark to dynamically assign the
cores to multiple apps. For example, when there is a
99 matches
Mail list logo