Cleanup after Spark SQL job with window aggregation takes a long time

2016-08-29 Thread Jestin Ma
After a Spark SQL job appending a few columns using window aggregation functions, and performing a join and some data massaging, I find that the cleanup after the job finishes saving the result data to disk takes as long if not longer than the job. I currently am performing window aggregation on

Re: Converting DataFrame's int column to Double

2016-08-25 Thread Jestin Ma
How about this: df.withColumn("doubles", col("ints").cast("double")).drop("ints") On Thu, Aug 25, 2016 at 2:09 PM, Marco Mistroni wrote: > hi all > i might be stuck in old code, but this is what i am doing to convert a > DF int column to Double > > val

Caching broadcasted DataFrames?

2016-08-25 Thread Jestin Ma
I have a DataFrame d1 that I would like to join with two separate DataFrames. Since d1 is small enough, I broadcast it. What I understand about cache vs broadcast is that cache leads to each executor storing the partitions its assigned in memory (cluster-wide in-memory). Broadcast leads to each

Re:

2016-08-14 Thread Jestin Ma
Have you tried doing the join in two parts (id == 0 and id != 0) and >> then >> > doing a union of the results? It is possible that with this technique, >> that >> > the join which only contains skewed data would be filtered enough to >> allow >> > broadcasting of o

Re:

2016-08-14 Thread Jestin Ma
own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such l

[no subject]

2016-08-14 Thread Jestin Ma
Hi, I'm currently trying to perform an outer join between two DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id. df1.id is skewed in that there are many 0's, the rest being unique IDs. df2.id is not skewed. If I filter df1.id != 0, then the join works well. If I don't, then the

DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Jestin Ma
When I load in a timestamp column and try to save it immediately without any transformations, the output time is unix time with padded 0's until there are 16 values. For example, loading in a time of August 3, 2016, 00:36:25 GMT, which is 1470184585 in UNIX time, saves as 147018458500. When

Changing Spark configuration midway through application.

2016-08-10 Thread Jestin Ma
If I run an application, for example with 3 joins: [join 1] [join 2] [join 3] [final join and save to disk] Could I change Spark properties in between each join? [join 1] [change properties] [join 2] [change properties] ... Or would I have to create a separate application with different

Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Jestin Ma
If we want to use versions of Spark beyond the official 2.0.0 release, specifically on Maven + Java, what steps should we take to upgrade? I can't find the newer versions on Maven central. Thank you! Jestin

Using Kyro for DataFrames (Dataset)?

2016-08-07 Thread Jestin Ma
When using DataFrames (Dataset), there's no option for an Encoder. Does that mean DataFrames (since it builds on top of an RDD) uses Java serialization? Does using Kyro make sense as an optimization here? If not, what's the difference between Java/Kyro serialization, Tungsten, and Encoders?

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Jestin Ma
in that way, what can minimize transporting temporary > information between worker-nodes. > > Try google in this way "Data locality in Hadoop" > > > 2016-08-01 4:41 GMT+03:00 Jestin Ma <jestinwith.a...@gmail.com>: > >> It seems that the number of task

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Jestin Ma
f memory errors. > > On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.a...@gmail.com> wrote: > > I am processing ~2 TB of hdfs data using DataFrames. The size of a task is > equal to the block size specified by hdfs, which happens to be 128 MB, > leading to about 15000 task

Tuning level of Parallelism: Increase or decrease?

2016-07-29 Thread Jestin Ma
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks. I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. I'm performing groupBy, count, and an outer-join with another

Spark Standalone Cluster: Having a master and worker on the same node

2016-07-27 Thread Jestin Ma
Hi, I'm doing performance testing and currently have 1 master node and 4 worker nodes and am submitting in client mode from a 6th cluster node. I know we can have a master and worker on the same node. Speaking in terms of performance and practicality, is it possible/suggested to have another

Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Jestin Ma
I know that Sparksession is replacing the SQL and HiveContexts, but what about SparkConf and SparkContext? Are those still relevant in our programs? Thank you! Jestin

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
I did netstat -apn | grep 4040 on machine 6, and I see tcp0 0 :::4040 :::* LISTEN 30597/java What does this mean? On Tue, Jul 26, 2016 at 6:47 AM, Jestin Ma <jestinwith.a...@gmail.com> wrote: > I do not deploy using cluster mode and I don't use E

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
figure out where the driver runs and use the machine's IP. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Tue, Jul

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
current job. > or you can check in master node by using netstat -apn | grep 4040 > > > > > On Jul 26, 2016, at 8:21 AM, Jestin Ma <jestinwith.a...@gmail.com> > wrote: > > > > Hello, when running spark jobs, I can access the master UI (port 8080 > one) no problem

Spark Web UI port 4040 not working

2016-07-25 Thread Jestin Ma
Hello, when running spark jobs, I can access the master UI (port 8080 one) no problem. However, I'm confused as to how to access the web UI to see jobs/tasks/stages/etc. I can access the master UI at http://:8080. But port 4040 gives me a -connection cannot be reached-. Is the web UI http://

Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Jestin Ma
Hello, Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2. The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me

Can Spark Dataframes preserve order when joining?

2016-06-29 Thread Jestin Ma
If it’s not too much trouble, could I get some pointers/help on this? (see link) http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order -also, as a side question, do