In my limited understanding, there must be single leader master in the
cluster. If there are multiple leaders, it will lead to unstable cluster as
each masters will keep scheduling independently. You should use zookeeper
for HA, so that standby masters can vote to find new leader if the primary
Thanks for the response.
But no this does not answer the question.
The question was: Is there a way (via some API call) to query the number
and type of daemons currently running in the Spark cluster.
Regards
On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote:
In my
If I have 5 nodes and I wish to maintain 1 Master and 2 Workers on each
node, so in total I will have 5 master and 10 Workers.
Now to maintain that setup I would like to query spark regarding the number
Masters and Workers that are currently available using API calls and then
take some
The Spark web UI offers a JSON interface with some of this information.
http://stackoverflow.com/a/29659630/877069
It's not an official API, so be warned that it may change unexpectedly
between versions, but you might find it helpful.
Nick
On Sun, Apr 26, 2015 at 9:46 AM
Yes, you are totally right, I have mistaken the meaning of the method, and it
works out perfectly as I construct it as the transpose. Really appreciate your
help, thanks!
Date: Sat, 25 Apr 2015 20:57:04 -0700
Subject: Re: How can I retrieve item-pair after calculating similarity using
Not sure if there's a spark native way but we've been using consul for this.
M
On Apr 26, 2015, at 5:17 AM, James King jakwebin...@gmail.com wrote:
Thanks for the response.
But no this does not answer the question.
The question was: Is there a way (via some API call) to query the
Hello All,
I'm trying to process a 3.5GB file on standalone mode using spark. I could
run my spark job succesfully on a 100MB file and it works as expected. But,
when I try to run it on the 3.5GB file, I run into the below error :
15/04/26 12:45:50 INFO BlockManagerMaster: Updated info of block
I happened to hit the following issue that prevents me from using UDFs with
case classes: https://issues.apache.org/jira/browse/SPARK-6054.
The issue already fixed for 1.3.1 but we are working on Amazon and it looks
that Amazon provide deployment of Spark 1.3.1 using their scripts.
Did someone
Understood.
On 26 Apr 2015 19:17, James King jakwebin...@gmail.com wrote:
Thanks for the response.
But no this does not answer the question.
The question was: Is there a way (via some API call) to query the number
and type of daemons currently running in the Spark cluster.
Regards
On
I'm not sure what the expected performance should be for this amount of
data, but you could try to increase the timeout with the property
spark.akka.timeout to see if that helps.
Bryan
On Sun, Apr 26, 2015 at 6:57 AM, Deepak Gopalakrishnan dgk...@gmail.com
wrote:
Hello All,
I'm trying to
Hello,
Just to add a bit more context :
I have done that in the code, but I cannot see it change from 30 seconds in
the log.
.set(spark.executor.memory, 10g)
.set(spark.driver.memory, 20g)
.set(spark.akka.timeout,6000)
PS : I understand that
Hi All,
I am trying to use a function within spark sql which accepts 2 - 4
arguments. I was able to get through compilation errors however, I see the
attached runtime exception when trying from Spark SQL.
(refer attachment for the complete stacktrace- StackTraceFor_runTestInSQL)
The function
Hi Deepak - please direct this to the user@ list. This list is for
development of Spark itself.
On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan
dgk...@gmail.com wrote:
Hello All,
I'm trying to process a 3.5GB file on standalone mode using spark. I could
run my spark job succesfully on
Hello Patrick,
Sure. I've posted this on user as well. Will be cool to get a response.
Thanks
Deepak
On Mon, Apr 27, 2015 at 2:58 AM, Patrick Wendell pwend...@gmail.com wrote:
Hi Deepak - please direct this to the user@ list. This list is for
development of Spark itself.
On Sun, Apr 26,
What is the complexity of transformations and actions in Spark, such as
groupBy(), flatMap(), collect(), etc.?
What attributes do we need to factor (such as number of partitions) in
while analyzing codes using these operations?
You can calculate the complexity of these operators by looking at the
RDD.scala basically. There, you will find - for example - what happens when
you call a map on RDDs. It's a simple Scala map function on a simple
Iterator of type T. Distinct has been implemented with mapping and grouping
on the
Hi Huai,
I'm using Spark 1.3.1.
You're right. The dataset is not generated by Spark. It's generated by Pig
using Parquet 1.6.0rc7 jars.
Let me see if I can send a testing dataset to you...
Jianshi
On Sat, Apr 25, 2015 at 2:22 AM, Yin Huai yh...@databricks.com wrote:
oh, I missed that. It
Made some progress on this. Adding hive jars to the system classpath is
needed. But looks like it needs to be towards the end of the system
classes. Manually adding the hive classpath into
Client.populateHadoopClasspath solved the issue. But a new issue has come
up. It looks like some hive
Had an offline discussion with Jianshi, the dataset was generated by Pig.
Jianshi - Could you please attach the output of parquet-schema
path-to-parquet-file? I guess this is a Parquet format
backwards-compatibility issue. Parquet hadn't standardized
representation of LIST and MAP until
Hi,
I am wondering how should we understand the running time of SparkSQL
queries? For example the physical query plan and the running time on each
stage? Is there any guide talking about this?
Thank you!
Best,
Wenlei
Had an offline discussion with Jianshi, the dataset was generated by Pig.
Jianshi - Could you please attach the output of parquet-schema
path-to-parquet-file? I guess this is a Parquet format
backwards-compatibility issue. Parquet hadn't standardized
representation of LIST and MAP until
Hi Peter,
As far as setting the parallelism, I would recommend setting it as early as
possible. Ideally, that would mean specifying the number of partitions
when loading the initial data (rather than repartitioning later on).
In general, working with Vector columns should be better since the
Unfortunately, the Pipelines API doesn't have multiclass logistic
regression yet, only binary. It's really a matter of modifying the current
implementation; I just added a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-7159
You'll need to use the old LogisticRegression API to do
The configuration key should be spark.akka.askTimeout for this timeout.
The time unit is seconds.
Best Regards,
Shixiong(Ryan) Zhu
2015-04-26 15:15 GMT-07:00 Deepak Gopalakrishnan dgk...@gmail.com:
Hello,
Just to add a bit more context :
I have done that in the code, but I cannot see it
24 matches
Mail list logo