Re: Querying Cluster State

2015-04-26 Thread ayan guha
In my limited understanding, there must be single leader master in the cluster. If there are multiple leaders, it will lead to unstable cluster as each masters will keep scheduling independently. You should use zookeeper for HA, so that standby masters can vote to find new leader if the primary

Re: Querying Cluster State

2015-04-26 Thread James King
Thanks for the response. But no this does not answer the question. The question was: Is there a way (via some API call) to query the number and type of daemons currently running in the Spark cluster. Regards On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote: In my

Querying Cluster State

2015-04-26 Thread James King
If I have 5 nodes and I wish to maintain 1 Master and 2 Workers on each node, so in total I will have 5 master and 10 Workers. Now to maintain that setup I would like to query spark regarding the number Masters and Workers that are currently available using API calls and then take some

Re: Querying Cluster State

2015-04-26 Thread Nicholas Chammas
The Spark web UI offers a JSON interface with some of this information. http://stackoverflow.com/a/29659630/877069 It's not an official API, so be warned that it may change unexpectedly between versions, but you might find it helpful. Nick On Sun, Apr 26, 2015 at 9:46 AM

RE: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-26 Thread Eric Zheng
Yes, you are totally right, I have mistaken the meaning of the method, and it works out perfectly as I construct it as the transpose. Really appreciate your help, thanks! Date: Sat, 25 Apr 2015 20:57:04 -0700 Subject: Re: How can I retrieve item-pair after calculating similarity using

Re: Querying Cluster State

2015-04-26 Thread michal.klo...@gmail.com
Not sure if there's a spark native way but we've been using consul for this. M On Apr 26, 2015, at 5:17 AM, James King jakwebin...@gmail.com wrote: Thanks for the response. But no this does not answer the question. The question was: Is there a way (via some API call) to query the

Timeout Error

2015-04-26 Thread Deepak Gopalakrishnan
Hello All, I'm trying to process a 3.5GB file on standalone mode using spark. I could run my spark job succesfully on a 100MB file and it works as expected. But, when I try to run it on the 3.5GB file, I run into the below error : 15/04/26 12:45:50 INFO BlockManagerMaster: Updated info of block

SQL UDF returning object of case class; regression from 1.2.0

2015-04-26 Thread Ophir Cohen
I happened to hit the following issue that prevents me from using UDFs with case classes: https://issues.apache.org/jira/browse/SPARK-6054. The issue already fixed for 1.3.1 but we are working on Amazon and it looks that Amazon provide deployment of Spark 1.3.1 using their scripts. Did someone

Re: Querying Cluster State

2015-04-26 Thread ayan guha
Understood. On 26 Apr 2015 19:17, James King jakwebin...@gmail.com wrote: Thanks for the response. But no this does not answer the question. The question was: Is there a way (via some API call) to query the number and type of daemons currently running in the Spark cluster. Regards On

Re: Timeout Error

2015-04-26 Thread Bryan Cutler
I'm not sure what the expected performance should be for this amount of data, but you could try to increase the timeout with the property spark.akka.timeout to see if that helps. Bryan On Sun, Apr 26, 2015 at 6:57 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I'm trying to

Re: Timeout Error

2015-04-26 Thread Deepak Gopalakrishnan
Hello, Just to add a bit more context : I have done that in the code, but I cannot see it change from 30 seconds in the log. .set(spark.executor.memory, 10g) .set(spark.driver.memory, 20g) .set(spark.akka.timeout,6000) PS : I understand that

Spark SQL - Registerfunction throwing MissingRequirementError in JavaMirror with primordial classloader

2015-04-26 Thread Sunita Arvind
Hi All, I am trying to use a function within spark sql which accepts 2 - 4 arguments. I was able to get through compilation errors however, I see the attached runtime exception when trying from Spark SQL. (refer attachment for the complete stacktrace- StackTraceFor_runTestInSQL) The function

Re: Spark timeout issue

2015-04-26 Thread Patrick Wendell
Hi Deepak - please direct this to the user@ list. This list is for development of Spark itself. On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I'm trying to process a 3.5GB file on standalone mode using spark. I could run my spark job succesfully on

Re: Spark timeout issue

2015-04-26 Thread Deepak Gopalakrishnan
Hello Patrick, Sure. I've posted this on user as well. Will be cool to get a response. Thanks Deepak On Mon, Apr 27, 2015 at 2:58 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Deepak - please direct this to the user@ list. This list is for development of Spark itself. On Sun, Apr 26,

Complexity of transformations in Spark

2015-04-26 Thread Vijayasarathy Kannan
What is the complexity of transformations and actions in Spark, such as groupBy(), flatMap(), collect(), etc.? What attributes do we need to factor (such as number of partitions) in while analyzing codes using these operations?

Re: Complexity of transformations in Spark

2015-04-26 Thread Zoltán Zvara
You can calculate the complexity of these operators by looking at the RDD.scala basically. There, you will find - for example - what happens when you call a map on RDDs. It's a simple Scala map function on a simple Iterator of type T. Distinct has been implemented with mapping and grouping on the

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Jianshi Huang
Hi Huai, I'm using Spark 1.3.1. You're right. The dataset is not generated by Spark. It's generated by Pig using Parquet 1.6.0rc7 jars. Let me see if I can send a testing dataset to you... Jianshi On Sat, Apr 25, 2015 at 2:22 AM, Yin Huai yh...@databricks.com wrote: oh, I missed that. It

Re: sparksql - HiveConf not found during task deserialization

2015-04-26 Thread Manku Timma
Made some progress on this. Adding hive jars to the system classpath is needed. But looks like it needs to be towards the end of the system classes. Manually adding the hive classpath into Client.populateHadoopClasspath solved the issue. But a new issue has come up. It looks like some hive

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Cheng Lian
Had an offline discussion with Jianshi, the dataset was generated by Pig. Jianshi - Could you please attach the output of parquet-schema path-to-parquet-file? I guess this is a Parquet format backwards-compatibility issue. Parquet hadn't standardized representation of LIST and MAP until

Understand the running time of SparkSQL queries

2015-04-26 Thread Wenlei Xie
Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running time on each stage? Is there any guide talking about this? Thank you! Best, Wenlei

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Cheng Lian
Had an offline discussion with Jianshi, the dataset was generated by Pig. Jianshi - Could you please attach the output of parquet-schema path-to-parquet-file? I guess this is a Parquet format backwards-compatibility issue. Parquet hadn't standardized representation of LIST and MAP until

Re: [Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-26 Thread Joseph Bradley
Hi Peter, As far as setting the parallelism, I would recommend setting it as early as possible. Ideally, that would mean specifying the number of partitions when loading the initial data (rather than repartitioning later on). In general, working with Vector columns should be better since the

Re: Multiclass classification using Ml logisticRegression

2015-04-26 Thread Joseph Bradley
Unfortunately, the Pipelines API doesn't have multiclass logistic regression yet, only binary. It's really a matter of modifying the current implementation; I just added a JIRA for it: https://issues.apache.org/jira/browse/SPARK-7159 You'll need to use the old LogisticRegression API to do

Re: Timeout Error

2015-04-26 Thread Shixiong Zhu
The configuration key should be spark.akka.askTimeout for this timeout. The time unit is seconds. Best Regards, Shixiong(Ryan) Zhu 2015-04-26 15:15 GMT-07:00 Deepak Gopalakrishnan dgk...@gmail.com: Hello, Just to add a bit more context : I have done that in the code, but I cannot see it