Sandy, Ryan, Andrew
Thanks very much. I think i now understand it better.
Jeff
From: ryan.blake.willi...@gmail.com
Date: Thu, 19 Nov 2015 06:00:30 +
Subject: Re: SequenceFile and object reuse
To: sandy.r...@cloudera.com; jeffsar...@hotmail.com
CC: user@spark.apache.org
Hey Jeff, in addition
Hey Jeff, in addition to what Sandy said, there are two more reasons that
this might not be as bad as it seems; I may be incorrect in my
understanding though.
First, the "additional step" you're referring to is not likely to be adding
any overhead; the "extra map" is really just materializing the
Hi Jeff,
Many access patterns simply take the result of hadoopFile and use it to
create some other object, and thus have no need for each input record to
refer to a different object. In those cases, the current API is more
performant than an alternative that would create an object for each
So we tried reading a sequencefile in Spark and realized that all our records
have ended up becoming the same.
THen one of us found this:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for
each record, directly caching the returned RDD or directly passing it to an