Repartition inefficient

2014-09-05 Thread anthonyjschu...@gmail.com
I wonder if anyone has any tips for using repartition? It seems that when you call the repartition method, the entire RDD gets split up, shuffled, and redistributed... This is an extremely heavy task if you have a large hdfs dataset and all you want to do is make sure your RDD is balance/ data

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread anthonyjschu...@gmail.com
I think that should be possible. Make sure spark is installed on your local machine and is the same version as on the cluster. -- View this message in context:

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
I've got a stack of Dell Commodity servers-- Ram~(8 to 32Gb) single or dual quad core processor cores per machine. I think I will have them loaded with CentOS. Eventually, I may want to add GPUs on the nodes to handle linear alg. operations... My Idea has been: 1) to find a way to configure

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
Jörn, thanks for the post... Unfortunately, I am stuck with the hardware I have and might not be able to get budget allocated for a new stack of servers when I've already got so many ok servers on hand... And even more unfortunately, a large subset of these machines are... shall we say...

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
This is what I thought the simplest method would be, but I can't seem to figure out how to configure it-- When you set: SPARK_WORKER_INSTANCES, to set the number of worker processes per node but when you set SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g.

Re: Cached RDD Block Size - Uneven Distribution

2014-08-13 Thread anthonyjschu...@gmail.com
I am having a similar problem: I have a large dataset in HDFS and (for a few possible reason including a filter operation, and some of my computation nodes simply not being hdfs datanodes) have a large skew on my RDD blocks: the master node always has the most, while the worker nodes have few...

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-08-07 Thread anthonyjschu...@gmail.com
Similarly, I am seeing tasks moved to the completed section which apparently haven't finished all elements... (succeeded/total 1)... is this related? -- View this message in context:

heterogeneous cluster hardware

2014-08-06 Thread anthonyjschu...@gmail.com
I'm sure this must be a fairly common use-case for spark, yet I have not found a satisfactory discussion of it on the spark website or forum: I work at a company with a lot of previous-generation server hardware sitting idle-- I want to add this hardware to my spark cluster to increase

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
I am (not) seeing this also... No items in the storage UI page. using 1.0 with HDFS... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html Sent from the Apache Spark User List

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
Good idea Andrew... Using this feature allowed me to debug that my app wasn't caching properly-- the UI is working as designed for me in 1.0. It might be a good idea to say no cached blocks instead of an empty page... just a thought... On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache

SparkSQL- saveAsParquetFile

2014-06-26 Thread anthonyjschu...@gmail.com
Hi all: I am attempting to execute a simple test of the SparkSQL system capability of persisting to parquet files... My code is: val conf = new SparkConf() .setMaster( local[1]) .setAppName(test) implicit val sc = new SparkContext(conf) val sqlContext = new

SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread anthonyjschu...@gmail.com
Hello all: I am attempting to persist a parquet file comprised of a SchemaRDD of nested case classes... Creating a schemaRDD object seems to work fine, but exception is thrown when I attempt to persist this object to a parquet file... my code: case class Trivial(trivial: String = trivial,

Re: SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread anthonyjschu...@gmail.com
Thanks. That might be a good note to add to the official Programming Guide... On Thu, Jun 26, 2014 at 5:05 PM, Michael Armbrust [via Apache Spark User List] ml-node+s1001560n8382...@n3.nabble.com wrote: Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1 release. On