-
From: Pat Ferrel [mailto:p...@occamsmachete.com]
Sent: Thursday, April 23, 2015 5:51 PM
To: user@spark.apache.org
Subject: Tasks run only on one machine
Using Spark streaming to create a large volume of small nano-batch input files,
~4k per file, thousands of ‘part-x’ files. When reading
Using Spark streaming to create a large volume of small nano-batch input files,
~4k per file, thousands of ‘part-x’ files. When reading the nano-batch
files and doing a distributed calculation my tasks run only on the machine
where it was launched. I’m launching in “yarn-client” mode. The
Sure
var columns = mc.textFile(source).map { line = line.split(delimiter) }
Here “source” is a comma delimited list of files or directories. Both the
textFile and .map tasks happen only on the machine they were launched from.
Later other distributed operations happen but I suspect if I can
Argh, I looked and there really isn’t that much data yet. There will be
thousands but starting small.
I bet this is just a total data size not requiring all workers thing—sorry,
nevermind.
On Apr 23, 2015, at 10:30 AM, Pat Ferrel p...@occamsmachete.com wrote:
They are in HDFS so available on
Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?
On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel p...@occamsmachete.com wrote:
Sure
var columns = mc.textFile(source).map { line = line.split(delimiter) }
Here “source”
Physically? Not sure, they were written using the nano-batch rdds in a
streaming job that is in a separate driver. The job is a Kafka consumer.
Would that effect all derived rdds? If so is there something I can do to mix it
up or does Spark know best about execution speed here?
On Apr 23,
Will you be able to paste code here?
On 23 April 2015 at 22:21, Pat Ferrel p...@occamsmachete.com wrote:
Using Spark streaming to create a large volume of small nano-batch input
files, ~4k per file, thousands of 'part-x' files. When reading the
nano-batch files and doing a distributed
They are in HDFS so available on all workers
On Apr 23, 2015, at 10:29 AM, Pat Ferrel p...@occamsmachete.com wrote:
Physically? Not sure, they were written using the nano-batch rdds in a
streaming job that is in a separate driver. The job is a Kafka consumer.
Would that effect all derived