Left Join Yields Results And Not Results

2016-09-25 Thread Aaron Jackson
Hi, I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very odd situation. I have two dataframes, base and updated. The "updated" dataframe contains constrained subset of data from "base" that I wish to excluded. Something like this. updated = base.where(base.X =

Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Aaron Jackson
Hi, I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job that creates some 120 stages. Eventually, the active and pending stages reduce down to a small bottleneck and it never fails... the tasks associated with the 10 (or so) running tasks are always allocated to the same

S3A Creating Task Per Byte (pyspark / 1.6.1)

2016-05-12 Thread Aaron Jackson
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3. I've done this previously with spark 1.5 with no issue. Attempting to load and count a single file as follows: dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv') dataFrame.count() But when it

Re: Best way to determine # of workers

2016-03-25 Thread Aaron Jackson
ner > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Thu, Mar 24, 2016 at 3:24 PM, Aaron Jackson <ajack...@pob

Re: Best way to determine # of workers

2016-03-24 Thread Aaron Jackson
Well thats unfortunate, just means I have to scrape the webui for that information. As to why, I have a cluster that is being increased in size to accommodate the processing requirements of a large set of jobs. Its useful to know when the new workers have joined the spark cluster. In my

Re: Passing parameters to spark SQL

2015-12-28 Thread Aaron Jackson
Yeah, that's what I thought. In this specific case, I'm porting over some scripts from an existing RDBMS platform. I had been porting them (slowly) to in-code notation with python or scala. However, to expedite my efforts (and presumably theirs since I'm not doing this forever), I went down the

Increasing memory usage on batch job (pyspark)

2015-12-01 Thread Aaron Jackson
Greetings, I am processing a "batch" of files and have structured an iterative process around them. Each batch is processed by first loading the data with spark-csv, performing some minor transformations and then writing back out as parquet. Absolutely no caching or shuffle should occur with