I wonder if anyone has any tips for using repartition?
It seems that when you call the repartition method, the entire RDD gets
split up, shuffled, and redistributed... This is an extremely heavy task if
you have a large hdfs dataset and all you want to do is make sure your RDD
is balance/ data
I think that should be possible. Make sure spark is installed on your local
machine and is the same version as on the cluster.
--
View this message in context:
I've got a stack of Dell Commodity servers-- Ram~(8 to 32Gb) single or dual
quad core processor cores per machine. I think I will have them loaded with
CentOS. Eventually, I may want to add GPUs on the nodes to handle linear
alg. operations...
My Idea has been:
1) to find a way to configure
Jörn, thanks for the post...
Unfortunately, I am stuck with the hardware I have and might not be
able to get budget allocated for a new stack of servers when I've
already got so many ok servers on hand... And even more
unfortunately, a large subset of these machines are... shall we say...
This is what I thought the simplest method would be, but I can't seem to
figure out how to configure it--
When you set:
SPARK_WORKER_INSTANCES, to set the number of worker processes per node
but when you set
SPARK_WORKER_MEMORY, to set how much total memory workers have to give
executors (e.g.
I am having a similar problem:
I have a large dataset in HDFS and (for a few possible reason including a
filter operation, and some of my computation nodes simply not being hdfs
datanodes) have a large skew on my RDD blocks: the master node always has
the most, while the worker nodes have few...
Similarly, I am seeing tasks moved to the completed section which
apparently haven't finished all elements... (succeeded/total 1)... is this
related?
--
View this message in context:
I'm sure this must be a fairly common use-case for spark, yet I have not
found a satisfactory discussion of it on the spark website or forum:
I work at a company with a lot of previous-generation server hardware
sitting idle-- I want to add this hardware to my spark cluster to increase
I am (not) seeing this also... No items in the storage UI page. using 1.0
with HDFS...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html
Sent from the Apache Spark User List
Good idea Andrew... Using this feature allowed me to debug that my app
wasn't caching properly-- the UI is working as designed for me in 1.0. It
might be a good idea to say no cached blocks instead of an empty page...
just a thought...
On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache
Hi all:
I am attempting to execute a simple test of the SparkSQL system capability
of persisting to parquet files...
My code is:
val conf = new SparkConf()
.setMaster( local[1])
.setAppName(test)
implicit val sc = new SparkContext(conf)
val sqlContext = new
Hello all:
I am attempting to persist a parquet file comprised of a SchemaRDD of nested
case classes...
Creating a schemaRDD object seems to work fine, but exception is thrown when
I attempt to persist this object to a parquet file...
my code:
case class Trivial(trivial: String = trivial,
Thanks. That might be a good note to add to the official Programming
Guide...
On Thu, Jun 26, 2014 at 5:05 PM, Michael Armbrust [via Apache Spark User
List] ml-node+s1001560n8382...@n3.nabble.com wrote:
Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1
release.
On
13 matches
Mail list logo