sanha opened a new pull request #5: [NEMO-27] Element Wise Block Write
URL: https://github.com/apache/incubator-nemo/pull/5
 
 
   JIRA: [NEMO-27: Element Wise Block 
Write](https://issues.apache.org/jira/browse/NEMO-27)
   
   **Major changes:**
   - Changed interface of `Partitioner` to accept data element (instead of 
`Iterable`) and return it's key.
   
   - Made `OutputWriter` write data to `BlockStore` element-by-element.
       - When a writer start to write data (element-wisely) to a `Block`, 
`Block` manages `Partition` according to the writing process. If an element 
with key which appears first time, the `Block` create a `Partition` having 
corresponding key. If else, the element will be accumulated to existing 
`Partition`.
       - When the writer finish the writing process and declare that the 
`Block` is committed, the `Block` commit all `Partition`s and prohibit any 
further write. 
       - The previous partition-level writing method also will be used for 
inter-`BlockStore` data movement. (For example, when we spill data from 
`SerializedMemoryStore` to `LocalFileStore`, we have to be able to fetch and 
write data in units of `Partition` to avoid extra (de)serialization.)
   
   **Minor changes to note:**
   - Adapted `DataSkewRuntimePass` and it's metric collecting process for the 
change.
       - The sizes of `Partition`s in a `Block` will be collected when the 
`Block` is committed, but not each `Partition` is written.
       - The type of partition size metric is changed from `Long` to 
`Pair<Integer, Long>`.
           - Before this pr, all partitions in `HashRange` were created by 
`Partitioner` even if the `Partition` is empty. Because of this, `List<Long>` 
was enough to represent the integer key and size of each `Partition`.
           - However, after this pr, only `Partition`s with actual data will be 
created, so the metric have to contain the key of each `Partition`.
   
   - Changed the output files created during ITCases to be deleted even if the 
test fails.
   
   **Tests for the changes:**
   
   - Changed interface of `Partitioner`
       - `DataTransferTest` and all ITCases cover the change.
   - Element-wise write to `BlockStore`
       - Refactored `BlockStoreTest` to cover this change.
   - Modified `DataSkewRuntimePass`
      - `DataSkewRuntimePassTest` and `MapReduceITCase#testDataSkew` cover this 
change.
   
   **Other comments:**
   - At now, [Intra-Stage 
pipelining](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-7) cannot 
fully exploit it's advantage because all output data produced by a `TaskGroup` 
have to be collected in `OutputWriter` until the `TaskGroup` is completed. If 
we support this element-wise write to `BlockStore`, these data can be 
serialized early, and it can reduce memory pressure.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to