Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under the Executors I noticed it to be 3.1 GB: Executors (1) Memory: 0.0 B Used (3.1 GB Total) -- View this message in context:

How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
I keep running out of memory on the driver when I attempt to do df.show(). Can anyone let me know how to estimate the size of the dataframe? Thanks! -- View this message in context:

Re: Need clarification regd deploy-mode client

2016-04-08 Thread bdev
Thanks Mandar for the clarification. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-regd-deploy-mode-client-tp26719p26725.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Need clarification regd deploy-mode client

2016-04-08 Thread bdev
I'm running pyspark with deploy-mode as client with yarn using dynamic allocation: pyspark --master yarn --deploy-mode client --executor-memory 6g --executor-cores 4 --driver-memory 4g The node where I'm running pyspark has 4GB memory but I keep running out of memory on this node. If using yarn,

Dataframe to parquet using hdfs or parquet block size

2016-04-07 Thread bdev
I need to save the dataframe to parquet format and need some input on choosing the appropriate block size to help efficiently parallelize/localize the data to the executors. Should I be using parquet block size or hdfs block size and what is the optimal block size to use on a 100 node cluster?

HiveContext throws org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2015-07-07 Thread bdev
Just trying to get started with Spark and attempting to use HiveContext using spark-shell to interact with existing Hive tables on my CDH cluster but keep running into the errors (pls see below) when I do 'hiveContext.sql(show tables)'. Wanted to know what all JARs need to be included to have this

Any patterns for multiplexing the streaming data

2014-11-06 Thread bdev
We are looking at consuming the kafka stream using Spark Streaming and transform into various subsets like applying some transformation or de-normalizing some fields, etc. and feed it back into Kafka as a different topic for downstream consumers. Wanted to know if there are any existing patterns

Algebird using spark-shell

2014-10-29 Thread bdev
I'm running into this error when I attempt to launch spark-shell passing in the algebird-core jar: ~~ $ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar scala import com.twitter.algebird._ import com.twitter.algebird._ scala import HyperLogLog._ import HyperLogLog._ scala

Error while running Streaming examples - no snappyjava in java.library.path

2014-10-19 Thread bdev
I built the latest Spark project and I'm running into these errors when attempting to run the streaming examples locally on the Mac, how do I fix these errors? java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)

How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread bdev
I'm using KafkaUtils.createStream for the input stream to pull messages from kafka which seems to return a ReceiverInputDStream. I do not see saveAsNewAPIHadoopFile available on ReceiverInputDStream and obviously run into this error: saveAsNewAPIHadoopFile is not a member of

RE: Save an RDD to a SQL Database

2014-08-27 Thread bdev
I have similar requirement to export the data to mysql. Just wanted to know what the best approach is so far after the research you guys have done. Currently thinking of saving to hdfs and use sqoop to handle export. Is that the best approach or is there any other way to write to mysql? Thanks!

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. I see the number of partitions requested is 8 (through HashPartitioner(8)). If I have a 40 node cluster, whats the recommended number of partitions? -- View this message in context:

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already partitioned, there is no need to worry about repartitioning. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html Sent from the Apache Spark User

Ways to partition the RDD

2014-08-13 Thread bdev
I've got ~500 tab delimited log files 25gigs each with page name and userId who viewed the page along with timestamp. I'm trying to build a basic spark app to get a unique visitors per page. I was able to achieve this using SparkSQL by registering the RDD of a case class and running a select

Re: Ways to partition the RDD

2014-08-13 Thread bdev
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node yarn-cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html Sent from the Apache Spark User List mailing list archive at Nabble.com.