I'm still bumping up against this issue: spark (and shark) are breaking my inputs into 64MB-sized splits. Anyone know where/how to configure spark so that it either doesn't split the inputs, or at least uses a much large split size? (E.g., 512MB.)

Thanks,

DR

On 07/15/2014 05:58 PM, David Rosenstrauch wrote:
Got a spark/shark cluster up and running recently, and have been kicking
the tires on it.  However, been wrestling with an issue on it that I'm
not quite sure how to solve.  (Or, at least, not quite sure about the
correct way to solve it.)

I ran a simple Hive query (select count ...) against a dataset of .tsv
files stored in S3, and then ran the same query on shark for comparison.
  But the shark query took 3x as long.

After a bit of digging, I was able to find out what was happening:
apparently with the hive query each map task was reading an input split
consisting of 2 entire files from the dataset (approximately 180MB
each), while with shark each task was reading an input split consisting
of a 64MB chunk from one of the files.  This made sense:  since the
shark query had to open each S3 file 3 separate times (and had to run 3x
as many tasks) it made sense that it took much longer.

After much experimentation I was finally able to work around this issue
by overriding the value of mapreduce.input.fileinputformat.split.minsize
in my hive-site.xml file.  (Bumping it up to 512MB.)  However, I'm
feeling like this isn't really the "right" way to solve the issue:

a) That parm is normally set to 1.  It doesn't seem right that I should
need to override it - or set it to a value as large as 512MB.

b) We only seem to experience this issue on an existing Hadoop cluster
that we've deployed spark/shark onto.  When we run the same query on a
new cluster launched via the spark ec2 scripts, the number of splits
seems to get calculated correctly - without the need for overriding that
parm.  This leads me to believe we may just have something misconfigured
on our existing cluster.

c) This seems like an error prone way to overcome this issue.  512MB is
an arbitrary value, and should I happen to be running a query against
files that are larger than 512MB, I'll again run into the chunking issue.

So my gut tells me there's a better way to solve this issue - i.e.,
somehow configuring spark so that the input splits it generates won't
chunk the input files.  Anyone know how to accomplish this / what I
might have misconfigured?

Thanks,

DR


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to