Repartition inefficient

2014-09-05 Thread anthonyjschu...@gmail.com
I wonder if anyone has any tips for using repartition?

It seems that when you call the repartition method, the entire RDD gets
split up, shuffled, and redistributed... This is an extremely heavy task if
you have a large hdfs dataset and all you want to do is make sure your RDD
is balance/ data skew is minimal...

I have tried coalesce(shuffle=false), but this seems to be somewhat
ineffective at balancing the blocks.

Care to share your experiences?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-inefficient-tp13587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread anthonyjschu...@gmail.com
I think that should be possible. Make sure spark is installed on your local
machine and is the same version as on the cluster. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13590.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
I've got a stack of Dell Commodity servers-- Ram~(8 to 32Gb) single or dual
quad core processor cores per machine. I think I will have them loaded with
CentOS. Eventually, I may want to add GPUs on the nodes to handle linear
alg. operations...

My Idea has been:

1) to find a way to configure Spark to allocate different resources
per-machine, per-job. -- at least have a standard executor... and allow
different machines to have different numbers of executors.

2) make (using vanilla spark) a pre-run optimization phase which benchmarks
the throughput of each node (per hardware), and repartition the dataset to
more efficiently use the hardware rather than rely on Spark Speculation--
which has always seemed a dis-optimal way to balance the load across several
differing machines.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
Jörn, thanks for the post...

Unfortunately, I am stuck with the hardware I have and might not be
able to get budget allocated for a new stack of servers when I've
already got so many ok servers on hand... And even more
unfortunately, a large subset of these machines are... shall we say...
extremely humble in their cpus and ram. My group has exclusive access
to the machine and rarely do we need to run concurrent jobs-- What I
really want is max capacity per-job. The applications are massive
machine-learning experiments, so I'm not sure about the feasibility of
breaking it up into concurrent jobs. At this point, I am seriously
considering dropping down to Akka-level programming. Why, oh why,
doesn't spark allow for allocating variable worker threads per host?
this would seem to be the correct point of abstraction that would
allow the construction of massive clusters using on-hand hardware?
(the scheduler probably wouldn't have to change at all)

On Thu, Aug 21, 2014 at 9:25 AM, Jörn Franke [via Apache Spark User
List] ml-node+s1001560n1258...@n3.nabble.com wrote:
 Hi,

 Well, you could use Mesos or Yarn2 to define  resources per Job - you can
 give only as much resources (cores, memory etc.) per machine as your worst
 machine has. The rest is done by Mesos or Yarn. By doing this you avoid a
 per machine resource assignment without any disadvantages. You can run
 without any problems run other jobs in parallel and older machines won't get
 overloaded.

 however, you should take care that your cluster does not get too
 heterogeneous.

 Best regards,
 Jörn

 Le 21 août 2014 16:55, [hidden email] [hidden email] a écrit :

 I've got a stack of Dell Commodity servers-- Ram~(8 to 32Gb) single or
 dual
 quad core processor cores per machine. I think I will have them loaded
 with
 CentOS. Eventually, I may want to add GPUs on the nodes to handle linear
 alg. operations...

 My Idea has been:

 1) to find a way to configure Spark to allocate different resources
 per-machine, per-job. -- at least have a standard executor... and allow
 different machines to have different numbers of executors.

 2) make (using vanilla spark) a pre-run optimization phase which
 benchmarks
 the throughput of each node (per hardware), and repartition the dataset to
 more efficiently use the hardware rather than rely on Spark Speculation--
 which has always seemed a dis-optimal way to balance the load across
 several
 differing machines.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12581.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]



 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12585.html
 To unsubscribe from heterogeneous cluster hardware, click here.
 NAML



-- 
A  N  T  H  O  N  Y   Ⓙ   S  C  H  U  L  T  E




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
This is what I thought the simplest method would be, but I can't seem to
figure out how to configure it--
When you set:

SPARK_WORKER_INSTANCES, to set the number of worker processes per node

but when you set 

SPARK_WORKER_MEMORY, to set how much total memory workers have to give
executors (e.g. 1000m, 2g)

I believe it is shared across all workers! So when worker memory gets set by
the master (I tried setting it in the spark-env.sh on a worker, but was
overridden by the setting on the master) it is not multiplied by the number
of workers?

(also, I'm not sure Worker_Instances isn't also overridden by the master...)

How would you suggest setting this up?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cached RDD Block Size - Uneven Distribution

2014-08-13 Thread anthonyjschu...@gmail.com
I am having a similar problem:

I have a large dataset in HDFS and (for a few possible reason including a
filter operation, and some of my computation nodes simply not being hdfs
datanodes) have a large skew on my RDD blocks: the master node always has
the most, while the worker nodes have few... (and the non-hdfs nodes have
none)

What is the preferred way to rebalance this RDD across the cluster? Some of
my nodes are very underutilized :( I have tried:

.coalesce(15000, shuffle = false)

which helps a little, but things are still not evenly distributed...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cached-RDD-Block-Size-Uneven-Distribution-tp11286p12055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-08-07 Thread anthonyjschu...@gmail.com
Similarly, I am seeing tasks moved to the completed section which
apparently haven't finished all elements... (succeeded/total  1)... is this
related?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/All-of-the-tasks-have-been-completed-but-the-Stage-is-still-shown-as-Active-tp9274p11725.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



heterogeneous cluster hardware

2014-08-06 Thread anthonyjschu...@gmail.com
I'm sure this must be a fairly common use-case for spark, yet I have not
found a satisfactory discussion of it on the spark website or forum:

I work at a company with a lot of previous-generation server hardware
sitting idle-- I want to add this hardware to my spark cluster to increase
performance! BUT: It is unclear as to whether the spark master will be able
to properly apportion jobs to the slaves if they have differing hardware
specs.

As I understand, the default spark launch scripts are incompatible with
per-node hardware configurations, but it seems I could compose custom
spark-conf.sh files for each slave to fully utilize its hardware.

Would the master take these per-node configurations into consideration when
allocating work? or would the cluster necessarily fall to the
lowest-common-hardware-denominator?

Is this an area which needs development? I might be willing to look into
attempting to introduce this functionality if it is lacking.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
I am (not) seeing this also... No items in the storage UI page. using 1.0
with HDFS...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
Good idea Andrew... Using this feature allowed me to debug that my app
wasn't caching properly-- the UI is working as designed for me in 1.0. It
might be a good idea to say no cached blocks instead of an empty page...
just a thought...


On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache Spark User List] 
ml-node+s1001560n11362...@n3.nabble.com wrote:

 Hi all,

 Could you check with `sc.getExecutorStorageStatus` to see if the blocks
 are in fact present? This returns a list of StorageStatus objects, and you
 can check whether each status' `blocks` is non-empty. If the blocks do
 exist, then this is likely a bug in the UI. There have been a couple of UI
 fixes since 1.0. Could you check if this is still a problem on the latest
 master: https://github.com/apache/spark

 Andrew


 2014-08-04 12:10 GMT-07:00 [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11362i=0 [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11362i=1:

 I am (not) seeing this also... No items in the storage UI page. using 1.0
 with HDFS...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11362i=2
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=11362i=3




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11362.html
  To unsubscribe from Can't see any thing one the storage panel of
 application UI, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=10296code=YW50aG9ueWpzY2h1bHRlQGdtYWlsLmNvbXwxMDI5Nnw2ODA5NTUxMDA=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
A  N  T  H  O  N  Y   Ⓙ   S  C  H  U  L  T  E




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11363.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

SparkSQL- saveAsParquetFile

2014-06-26 Thread anthonyjschu...@gmail.com
Hi all:
I am attempting to execute a simple test of the SparkSQL system capability
of persisting to parquet files...

My code is:

 val conf = new SparkConf()
.setMaster( local[1])
.setAppName(test)

  implicit val sc = new SparkContext(conf)

  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  import sqlContext._

  case class Trivial(trivial: String = trivial)
  val rdd = sc.parallelize(Seq(Trivial(s), Trivial(T)))
  rdd.saveAsParquetFile(trivial.parquet)

When this code executes, a trivial.parquet directory is created, and a
_temporary subdirectory, but there is no content in these files... only
directories. Is there an obvious mistake in my code which would cause this
execution to fail?

Thank you--

Tony



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-tp8375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread anthonyjschu...@gmail.com
Hello all:
I am attempting to persist a parquet file comprised of a SchemaRDD of nested
case classes... 

Creating a schemaRDD object seems to work fine, but exception is thrown when
I attempt to persist this object to a parquet file...

my code:

  case class Trivial(trivial: String = trivial, lt: LessTrivial)
  case class LessTrivial(i: Int = 1)

  val conf = new SparkConf()
.setMaster( local[1])
.setAppName(test)

  implicit val sc = new SparkContext(conf)
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)

  import sqlContext._

  val rdd = sqlContext.createSchemaRDD(sc.parallelize(Seq(Trivial(s,
LessTrivial(1)), Trivial(T, LessTrivial(2) //no exceptions.

  rdd.saveAsParquetFile(trivial.parquet1) //exception:
java.lang.RuntimeException: Unsupported datatype
StructType(List(StructField(i,IntegerType,true)))


Is persisting SchemaRDDs containing nested case classes supported for
Parquet files?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread anthonyjschu...@gmail.com
Thanks. That might be a good note to add to the official Programming
Guide...


On Thu, Jun 26, 2014 at 5:05 PM, Michael Armbrust [via Apache Spark User
List] ml-node+s1001560n8382...@n3.nabble.com wrote:

 Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1
 release.


 On Thu, Jun 26, 2014 at 3:03 PM, [hidden email]
 http://user/SendEmail.jtp?type=nodenode=8382i=0 [hidden email]
 http://user/SendEmail.jtp?type=nodenode=8382i=1 wrote:

 Hello all:
 I am attempting to persist a parquet file comprised of a SchemaRDD of
 nested
 case classes...

 Creating a schemaRDD object seems to work fine, but exception is thrown
 when
 I attempt to persist this object to a parquet file...

 my code:

   case class Trivial(trivial: String = trivial, lt: LessTrivial)
   case class LessTrivial(i: Int = 1)

   val conf = new SparkConf()
 .setMaster( local[1])
 .setAppName(test)

   implicit val sc = new SparkContext(conf)
   val sqlContext = new org.apache.spark.sql.SQLContext(sc)

   import sqlContext._

   val rdd = sqlContext.createSchemaRDD(sc.parallelize(Seq(Trivial(s,
 LessTrivial(1)), Trivial(T, LessTrivial(2) //no exceptions.

   rdd.saveAsParquetFile(trivial.parquet1) //exception:
 java.lang.RuntimeException: Unsupported datatype
 StructType(List(StructField(i,IntegerType,true)))


 Is persisting SchemaRDDs containing nested case classes supported for
 Parquet files?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377p8382.html
  To unsubscribe from SparkSQL- Nested CaseClass Parquet failure, click
 here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=8377code=YW50aG9ueWpzY2h1bHRlQGdtYWlsLmNvbXw4Mzc3fDY4MDk1NTEwMA==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
A  N  T  H  O  N  Y   Ⓙ   S  C  H  U  L  T  E




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-tp8377p8386.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.