All,

I'm using the Spark shell to interact with a small test deployment of
Spark, built from the current master branch. I'm processing a dataset
comprising a few thousand objects on Google Cloud Storage, split into a
half dozen directories. My code constructs an object--let me call it the
Dataset object--that defines a distinct RDD for each directory. The
constructor of the object only defines the RDDs; it does not actually
evaluate them, so I would expect it to return very quickly. Indeed, the
logging code in the constructor prints a line signaling the completion of
the code almost immediately after invocation, but the Spark shell does not
show the prompt right away. Instead, it spends a few minutes seemingly
frozen, eventually producing the following output:

14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process
: 9

14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process
: 759

14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process
: 228

14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process
: 3076

14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process
: 1013

14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process
: 156

This stage is inexplicably slow. What could be happening?

Thanks.


Alex

Reply via email to