Repartition inefficient
I wonder if anyone has any tips for using repartition? It seems that when you call the repartition method, the entire RDD gets split up, shuffled, and redistributed... This is an extremely heavy task if you have a large hdfs dataset and all you want to do is make sure your RDD is balance/ data skew is minimal... I have tried coalesce(shuffle=false), but this seems to be somewhat ineffective at balancing the blocks. Care to share your experiences? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-inefficient-tp13587.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running spark-shell (or queries) over the network (not from master)
I think that should be possible. Make sure spark is installed on your local machine and is the same version as on the cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13590.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: heterogeneous cluster hardware
I've got a stack of Dell Commodity servers-- Ram~(8 to 32Gb) single or dual quad core processor cores per machine. I think I will have them loaded with CentOS. Eventually, I may want to add GPUs on the nodes to handle linear alg. operations... My Idea has been: 1) to find a way to configure Spark to allocate different resources per-machine, per-job. -- at least have a standard executor... and allow different machines to have different numbers of executors. 2) make (using vanilla spark) a pre-run optimization phase which benchmarks the throughput of each node (per hardware), and repartition the dataset to more efficiently use the hardware rather than rely on Spark Speculation-- which has always seemed a dis-optimal way to balance the load across several differing machines. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: heterogeneous cluster hardware
Jörn, thanks for the post... Unfortunately, I am stuck with the hardware I have and might not be able to get budget allocated for a new stack of servers when I've already got so many ok servers on hand... And even more unfortunately, a large subset of these machines are... shall we say... extremely humble in their cpus and ram. My group has exclusive access to the machine and rarely do we need to run concurrent jobs-- What I really want is max capacity per-job. The applications are massive machine-learning experiments, so I'm not sure about the feasibility of breaking it up into concurrent jobs. At this point, I am seriously considering dropping down to Akka-level programming. Why, oh why, doesn't spark allow for allocating variable worker threads per host? this would seem to be the correct point of abstraction that would allow the construction of massive clusters using on-hand hardware? (the scheduler probably wouldn't have to change at all) On Thu, Aug 21, 2014 at 9:25 AM, Jörn Franke [via Apache Spark User List] ml-node+s1001560n1258...@n3.nabble.com wrote: Hi, Well, you could use Mesos or Yarn2 to define resources per Job - you can give only as much resources (cores, memory etc.) per machine as your worst machine has. The rest is done by Mesos or Yarn. By doing this you avoid a per machine resource assignment without any disadvantages. You can run without any problems run other jobs in parallel and older machines won't get overloaded. however, you should take care that your cluster does not get too heterogeneous. Best regards, Jörn Le 21 août 2014 16:55, [hidden email] [hidden email] a écrit : I've got a stack of Dell Commodity servers-- Ram~(8 to 32Gb) single or dual quad core processor cores per machine. I think I will have them loaded with CentOS. Eventually, I may want to add GPUs on the nodes to handle linear alg. operations... My Idea has been: 1) to find a way to configure Spark to allocate different resources per-machine, per-job. -- at least have a standard executor... and allow different machines to have different numbers of executors. 2) make (using vanilla spark) a pre-run optimization phase which benchmarks the throughput of each node (per hardware), and repartition the dataset to more efficiently use the hardware rather than rely on Spark Speculation-- which has always seemed a dis-optimal way to balance the load across several differing machines. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12585.html To unsubscribe from heterogeneous cluster hardware, click here. NAML -- A N T H O N Y Ⓙ S C H U L T E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12587.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: heterogeneous cluster hardware
This is what I thought the simplest method would be, but I can't seem to figure out how to configure it-- When you set: SPARK_WORKER_INSTANCES, to set the number of worker processes per node but when you set SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) I believe it is shared across all workers! So when worker memory gets set by the master (I tried setting it in the spark-env.sh on a worker, but was overridden by the setting on the master) it is not multiplied by the number of workers? (also, I'm not sure Worker_Instances isn't also overridden by the master...) How would you suggest setting this up? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12609.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Cached RDD Block Size - Uneven Distribution
I am having a similar problem: I have a large dataset in HDFS and (for a few possible reason including a filter operation, and some of my computation nodes simply not being hdfs datanodes) have a large skew on my RDD blocks: the master node always has the most, while the worker nodes have few... (and the non-hdfs nodes have none) What is the preferred way to rebalance this RDD across the cluster? Some of my nodes are very underutilized :( I have tried: .coalesce(15000, shuffle = false) which helps a little, but things are still not evenly distributed... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286p12055.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: All of the tasks have been completed but the Stage is still shown as Active?
Similarly, I am seeing tasks moved to the completed section which apparently haven't finished all elements... (succeeded/total 1)... is this related? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/All-of-the-tasks-have-been-completed-but-the-Stage-is-still-shown-as-Active-tp9274p11725.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
heterogeneous cluster hardware
I'm sure this must be a fairly common use-case for spark, yet I have not found a satisfactory discussion of it on the spark website or forum: I work at a company with a lot of previous-generation server hardware sitting idle-- I want to add this hardware to my spark cluster to increase performance! BUT: It is unclear as to whether the spark master will be able to properly apportion jobs to the slaves if they have differing hardware specs. As I understand, the default spark launch scripts are incompatible with per-node hardware configurations, but it seems I could compose custom spark-conf.sh files for each slave to fully utilize its hardware. Would the master take these per-node configurations into consideration when allocating work? or would the cluster necessarily fall to the lowest-common-hardware-denominator? Is this an area which needs development? I might be willing to look into attempting to introduce this functionality if it is lacking. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't see any thing one the storage panel of application UI
I am (not) seeing this also... No items in the storage UI page. using 1.0 with HDFS... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't see any thing one the storage panel of application UI
Good idea Andrew... Using this feature allowed me to debug that my app wasn't caching properly-- the UI is working as designed for me in 1.0. It might be a good idea to say no cached blocks instead of an empty page... just a thought... On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache Spark User List] ml-node+s1001560n11362...@n3.nabble.com wrote: Hi all, Could you check with `sc.getExecutorStorageStatus` to see if the blocks are in fact present? This returns a list of StorageStatus objects, and you can check whether each status' `blocks` is non-empty. If the blocks do exist, then this is likely a bug in the UI. There have been a couple of UI fixes since 1.0. Could you check if this is still a problem on the latest master: https://github.com/apache/spark Andrew 2014-08-04 12:10 GMT-07:00 [hidden email] http://user/SendEmail.jtp?type=nodenode=11362i=0 [hidden email] http://user/SendEmail.jtp?type=nodenode=11362i=1: I am (not) seeing this also... No items in the storage UI page. using 1.0 with HDFS... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=11362i=2 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=11362i=3 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11362.html To unsubscribe from Can't see any thing one the storage panel of application UI, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=10296code=YW50aG9ueWpzY2h1bHRlQGdtYWlsLmNvbXwxMDI5Nnw2ODA5NTUxMDA= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- A N T H O N Y Ⓙ S C H U L T E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11363.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
SparkSQL- saveAsParquetFile
Hi all: I am attempting to execute a simple test of the SparkSQL system capability of persisting to parquet files... My code is: val conf = new SparkConf() .setMaster( local[1]) .setAppName(test) implicit val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class Trivial(trivial: String = trivial) val rdd = sc.parallelize(Seq(Trivial(s), Trivial(T))) rdd.saveAsParquetFile(trivial.parquet) When this code executes, a trivial.parquet directory is created, and a _temporary subdirectory, but there is no content in these files... only directories. Is there an obvious mistake in my code which would cause this execution to fail? Thank you-- Tony -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-tp8375.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
SparkSQL- Nested CaseClass Parquet failure
Hello all: I am attempting to persist a parquet file comprised of a SchemaRDD of nested case classes... Creating a schemaRDD object seems to work fine, but exception is thrown when I attempt to persist this object to a parquet file... my code: case class Trivial(trivial: String = trivial, lt: LessTrivial) case class LessTrivial(i: Int = 1) val conf = new SparkConf() .setMaster( local[1]) .setAppName(test) implicit val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val rdd = sqlContext.createSchemaRDD(sc.parallelize(Seq(Trivial(s, LessTrivial(1)), Trivial(T, LessTrivial(2) //no exceptions. rdd.saveAsParquetFile(trivial.parquet1) //exception: java.lang.RuntimeException: Unsupported datatype StructType(List(StructField(i,IntegerType,true))) Is persisting SchemaRDDs containing nested case classes supported for Parquet files? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SparkSQL- Nested CaseClass Parquet failure
Thanks. That might be a good note to add to the official Programming Guide... On Thu, Jun 26, 2014 at 5:05 PM, Michael Armbrust [via Apache Spark User List] ml-node+s1001560n8382...@n3.nabble.com wrote: Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1 release. On Thu, Jun 26, 2014 at 3:03 PM, [hidden email] http://user/SendEmail.jtp?type=nodenode=8382i=0 [hidden email] http://user/SendEmail.jtp?type=nodenode=8382i=1 wrote: Hello all: I am attempting to persist a parquet file comprised of a SchemaRDD of nested case classes... Creating a schemaRDD object seems to work fine, but exception is thrown when I attempt to persist this object to a parquet file... my code: case class Trivial(trivial: String = trivial, lt: LessTrivial) case class LessTrivial(i: Int = 1) val conf = new SparkConf() .setMaster( local[1]) .setAppName(test) implicit val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val rdd = sqlContext.createSchemaRDD(sc.parallelize(Seq(Trivial(s, LessTrivial(1)), Trivial(T, LessTrivial(2) //no exceptions. rdd.saveAsParquetFile(trivial.parquet1) //exception: java.lang.RuntimeException: Unsupported datatype StructType(List(StructField(i,IntegerType,true))) Is persisting SchemaRDDs containing nested case classes supported for Parquet files? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377p8382.html To unsubscribe from SparkSQL- Nested CaseClass Parquet failure, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=8377code=YW50aG9ueWpzY2h1bHRlQGdtYWlsLmNvbXw4Mzc3fDY4MDk1NTEwMA== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- A N T H O N Y Ⓙ S C H U L T E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377p8386.html Sent from the Apache Spark User List mailing list archive at Nabble.com.