Hi everyone, Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take() consistently has a slower execution time on the later release. I was wondering if anyone else has had similar observations.
I have two setups where this reproduces. The first is a local test. I launched a spark cluster with 4 worker JVMs on my Mac, and launched a Spark-Shell. I retrieved the text file and immediately called rdd.take(N) on it, where N varied. The RDD is a plaintext CSV, 4GB in size, split over 8 files, which ends up having 128 partitions, and a total of 80000000 rows. The numbers I discovered between Spark 1.0.2 and Spark 1.1.1 are, with all numbers being in seconds: 10000 items Spark 1.0.2: 0.069281, 0.012261, 0.011083 Spark 1.1.1: 0.11577, 0.097636, 0.11321 40000 items Spark 1.0.2: 0.023751, 0.069365, 0.023603 Spark 1.1.1: 0.224287, 0.229651, 0.158431 100000 items Spark 1.0.2: 0.047019, 0.049056, 0.042568 Spark 1.1.1: 0.353277, 0.288965, 0.281751 400000 items Spark 1.0.2: 0.216048, 0.198049, 0.796037 Spark 1.1.1: 1.865622, 2.224424, 2.037672 This small test suite indicates a consistently reproducible performance regression. I also notice this on a larger scale test. The cluster used is on EC2: ec2 instance type: m2.4xlarge 10 slaves, 1 master ephemeral storage 70 cores, 50 GB/box In this case, I have a 100GB dataset split into 78 files totally 350 million items, and I take the first 50,000 items from the RDD. In this case, I have tested this on different formats of the raw data. With plaintext files: Spark 1.0.2: 0.422s, 0.363s, 0.382s Spark 1.1.1: 4.54s, 1.28s, 1.221s, 1.13s With snappy-compressed Avro files: Spark 1.0.2: 0.73s, 0.395s, 0.426s Spark 1.1.1: 4.618s, 1.81s, 1.158s, 1.333s Again demonstrating a reproducible performance regression. I was wondering if anyone else observed this regression, and if so, if anyone would have any idea what could possibly have caused it between Spark 1.0.2 and Spark 1.1.1? Thanks, -Matt Cheah
smime.p7s
Description: S/MIME cryptographic signature