Thanks Mandar, I couldn't see anything under the 'Storage Section' but under
the Executors I noticed it to be 3.1 GB:
Executors (1)
Memory: 0.0 B Used (3.1 GB Total)
--
View this message in context:
I keep running out of memory on the driver when I attempt to do df.show().
Can anyone let me know how to estimate the size of the dataframe?
Thanks!
--
View this message in context:
Thanks Mandar for the clarification.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-regd-deploy-mode-client-tp26719p26725.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm running pyspark with deploy-mode as client with yarn using dynamic
allocation:
pyspark --master yarn --deploy-mode client --executor-memory 6g
--executor-cores 4 --driver-memory 4g
The node where I'm running pyspark has 4GB memory but I keep running out of
memory on this node. If using yarn,
I need to save the dataframe to parquet format and need some input on
choosing the appropriate block size to help efficiently parallelize/localize
the data to the executors. Should I be using parquet block size or hdfs
block size and what is the optimal block size to use on a 100 node cluster?
Just trying to get started with Spark and attempting to use HiveContext using
spark-shell to interact with existing Hive tables on my CDH cluster but keep
running into the errors (pls see below) when I do 'hiveContext.sql(show
tables)'. Wanted to know what all JARs need to be included to have this
We are looking at consuming the kafka stream using Spark Streaming and
transform into various subsets like applying some transformation or
de-normalizing some fields, etc. and feed it back into Kafka as a different
topic for downstream consumers.
Wanted to know if there are any existing patterns
I'm running into this error when I attempt to launch spark-shell passing in
the algebird-core jar:
~~
$ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar
scala import com.twitter.algebird._
import com.twitter.algebird._
scala import HyperLogLog._
import HyperLogLog._
scala
I built the latest Spark project and I'm running into these errors when
attempting to run the streaming examples locally on the Mac, how do I fix
these errors?
java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
I'm using KafkaUtils.createStream for the input stream to pull messages from
kafka which seems to return a ReceiverInputDStream. I do not see
saveAsNewAPIHadoopFile available on ReceiverInputDStream and obviously run
into this error:
saveAsNewAPIHadoopFile is not a member of
I have similar requirement to export the data to mysql. Just wanted to know
what the best approach is so far after the research you guys have done.
Currently thinking of saving to hdfs and use sqoop to handle export. Is that
the best approach or is there any other way to write to mysql? Thanks!
Thanks, will give that a try.
I see the number of partitions requested is 8 (through HashPartitioner(8)).
If I have a 40 node cluster, whats the recommended number of partitions?
--
View this message in context:
Thanks Daniel for the detailed information. Since the RDD is already
partitioned, there is no need to worry about repartitioning.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
Sent from the Apache Spark User
I've got ~500 tab delimited log files 25gigs each with page name and userId
who viewed the page along with timestamp.
I'm trying to build a basic spark app to get a unique visitors per page. I
was able to achieve this using SparkSQL by registering the RDD of a case
class and running a select
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node
yarn-cluster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
15 matches
Mail list logo