Bellow is a very simple application. It runs very slowly. It does not look
like I am getting a lot of parallel execution. I image this is a very common
work flow. Periodically I want to runs some standard summary statistics
across several different data sets.

Any suggestions would be greatly appreciated.

Andy

Overview 
All the sets use the same data format. The data is twitter tweet stored in
JSON. The JSON is very complicated. Each record could be as large as 4k. The
data is collected using spark streaming. Every mini batch is stored in S3 as
separate object. E.G. s3n://buckName/date/timestampMS/parts*. I only select
one col. From the data frame. The column is “top” level key in the JSON
structure

The program is simple

For each data set
1. Find all the part files
2. Load them into a data frame
3. Calculate the summary stat and print
4. Free memory
In my example bellow the data sets are not very big.

# fullPath is list of part files.
sqlContext.read.format('json').load(fullPath).select("body") #.cache()


1
%%timeit -n 1 -r 1
2
# %timeit # line magic
3
# %%timeit # cell magic
4
# -n 1 -r 1 # run cell once
5
​
6
for prefix in districtDataSets:
7
    dataSet = [name for name in constituentDataSets if
name.startswith(prefix)]
8
    # print(dataSets)
9
    # would be nice if we could have this loop run in parallel
10
    constituentDFS = getDataFrames(dataSet) # returns a dictionary
11
    # we could union but would probably be slower
12
    total = 0
13
    for name in constituentDFS:
14
        c = constituentDFS[name].count();
15
        total = total + c;
16
    print("{} {:15,}".format(prefix, total))
17
    # free memory
18
    del constituentDFS
19
    
ne-2 110169
fl-8 12
mi-1 2552
ny-19 27657
ny-24 59739
pa-8 42867
wi-8 7352
ny-3 51136
ny-1 105296
ny-22 5671287
mn-2 34834
tx-8 5246
il-6 12772
co-6 24700
1 loop, best of 1: 2h 41min 8s per loop
Environment

I am using spark-1.6.1

My app is using
10 cores, 
6GB per node
5 executors
1 driver

Each executor has at most 2 active tasks


Over all the resources do not seem to be utilized well. I do not think
adding machines would improve performance.

I launch the notebook server as follows

#

# allow notebooks to use all avalible resources

# 

export PYSPARK_PYTHON=python3.4

export PYSPARK_DRIVER_PYTHON=python3.4

export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"

$SPARK_ROOT/bin/pyspark \

    --master $MASTER_URL \

    --driver-memory 2G \

    --executor-memory 6G \

    $extraPkgs \

    $*



All my data is being read from s3
- Is there an easy way to figure out how much time I am spending reading?
- I am guessing S3 is really slow. I have lot of objects to read.
- I image copying the data to HDFS would run faster how ever I have not
found an easy way to copy the data. I am using ec2. Looks like I would have
to copy from s3 to a file partition in my cluster and then copy to HDFS



Looking at the stages It does not look like shuffle is a major problem





Reply via email to