Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-20 Thread Nicholas Chammas
Mapper so that the CPU treads > >> can't be fully utilized. > >> > >> -Original Message- > >> From: Gautham [mailto:gautham.a...@gmail.com] > >> Sent: Wednesday, December 10, 2014 3:00 AM > >> To: u...@spark.incubator.apache.org >

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-16 Thread Gautham Anil
splitted for processing by Mapreduce. A single >> gz file can only be processed by a single Mapper so that the CPU treads >> can't be fully utilized. >> >> -Original Message- >> From: Gautham [mailto:gautham.a...@gmail.com] >> Sent: Wednesday, Decem

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-17 Thread Nicholas Chammas
mailto:gautham.a...@gmail.com] > Sent: Wednesday, December 10, 2014 3:00 AM > To: u...@spark.incubator.apache.org > Subject: pyspark sc.textFile uses only 4 out of 32 threads per node > > I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 > r3.4xlarge machine

RE: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sun, Rui
- From: Gautham [mailto:gautham.a...@gmail.com] Sent: Wednesday, December 10, 2014 3:00 AM To: u...@spark.incubator.apache.org Subject: pyspark sc.textFile uses only 4 out of 32 threads per node I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 r3.4xlarge machines where e

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sebastián Ramírez
Are you reading the file from your driver (main / master) program? Is your file in a distributed system like HDFS? available to all your nodes? It might be due to the laziness of transformations: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations "Transformations" are lazy

pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-09 Thread Gautham
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do sc.textFile to load data from a number of gz files, it does not progress as fast as expected. When I log-in to a child node and run top, I see only 4