Mapper so that the CPU treads
> >> can't be fully utilized.
> >>
> >> -Original Message-
> >> From: Gautham [mailto:gautham.a...@gmail.com]
> >> Sent: Wednesday, December 10, 2014 3:00 AM
> >> To: u...@spark.incubator.apache.org
>
splitted for processing by Mapreduce. A single
>> gz file can only be processed by a single Mapper so that the CPU treads
>> can't be fully utilized.
>>
>> -Original Message-
>> From: Gautham [mailto:gautham.a...@gmail.com]
>> Sent: Wednesday, Decem
mailto:gautham.a...@gmail.com]
> Sent: Wednesday, December 10, 2014 3:00 AM
> To: u...@spark.incubator.apache.org
> Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
>
> I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
> r3.4xlarge machine
-
From: Gautham [mailto:gautham.a...@gmail.com]
Sent: Wednesday, December 10, 2014 3:00 AM
To: u...@spark.incubator.apache.org
Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
r3.4xlarge machines where e
Are you reading the file from your driver (main / master) program?
Is your file in a distributed system like HDFS? available to all your nodes?
It might be due to the laziness of transformations:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
"Transformations" are lazy
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
sc.textFile to load data from a number of gz files, it does not progress as
fast as expected. When I log-in to a child node and run top, I see only 4