[ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619100#comment-14619100
 ] 

Matt Massie commented on SPARK-7263:
------------------------------------

The Spark shuffle manager APIs, in their current state, don't support a 
standalone shuffle implementation. If you like, I can split my pull request 
into two parts: (a) changes to Spark, e.g. [serializing class 
info|https://github.com/massie/spark/commit/fc03c0bd29fa71ff390b86a8f6fd31c1cbef960f],
 making APIs public, etc and (b) the new Parquet implementation.

I think your comment that "we're creating a whole new shuffle subsystem for one 
data type" is technically correct but it misses the bigger point. The currently 
supported data type, {{IndexedRecord}} is the base type for all Avro objects 
and includes three methods -- {{get}}, {{put}} and {{getSchema}} -- the 
primitives necessary for describing, storing and building objects. Since 
Parquet supports Thrift and Protobuf too, it would be straight-forward to add 
their base types here too which perform similar functions.

I reached out to Michael Armbrust and looked at the Spark SQL code, in depth, 
before I wrote this. I had hoped to piggyback on the Spark SQL work but found 
that it wasn't a good match. If you like, I can list all the issues that I 
found.

I'd like to know why you think this would be a maintenance nightmare? I think 
otherwise, but of course I wrote this. Can you be more specific with your 
concerns around maintenance?

> Add new shuffle manager which stores shuffle blocks in Parquet
> --------------------------------------------------------------
>
>                 Key: SPARK-7263
>                 URL: https://issues.apache.org/jira/browse/SPARK-7263
>             Project: Spark
>          Issue Type: New Feature
>          Components: Block Manager
>            Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to