[jira] [Comment Edited] (FLINK-838) GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere

Artem Tsikiridis (JIRA) Mon, 07 Jul 2014 15:34:16 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054250#comment-14054250
 ]


Artem Tsikiridis edited comment on FLINK-838 at 7/7/14 10:32 PM:
-----------------------------------------------------------------

Hello,

this week 

1) I implemented sorting for mapred. One can now specify a key comparator in 
the job's {{JobConf}} and it alters the sorting behavior of Flink. I did a 
rework of {{WritableComparator}}, by setting the custom comparator higher in 
priority. A descending order comparator test works fine.  I'm dealing with some 
final issues. The goal is to pass a full Secondary Sorting. This will prove us 
that the partitioner, the sorting  and the grouping of values before reducing 
are working! Not yet passing in all cases I have in mind, unfortunately. 
Hopefully tomorrow.

WIP branch : https://github.com/atsikiridis/incubator-flink/tree/sorting-hadoop

2) Started working for the mapping of the {{DistributedCache}}. I passed a txt 
file from Hadoop's DC to Flink's. The process seems straightforward for 
{{mapred}} and should also be a matter of this week. A lookup table test-case 
would be nice here.

 Unfortunately, it is a  deprecated interface 
(https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/filecache/DistributedCache.html)
  and {{mapreduce}} is suggested. I'll keep it for now, since we started with 
mapred. The equivalent for {{mapreduce}} seems similar( though probably more 
complex, as the API got richer :) ) and one can implement it by using the 
{{mapred}} one as a reference. In GSoC, terms: maybe the last week if there is 
time. You see after this week, I would like to focus on the complete mapping  
of {{mapred}} and in the future one (or me :) ? )  can keep it as a reference 
to implement {{mapreduce}}.

3) I have done some experiments improving the grouping of values before the 
reducer. The current implementation is very naive... But this is related to the 
sorting and hopefully these days it can glue nicely.

So, I would like to finish all matters considering sorting, distCache by the 
end of the week  and then there will be one last big chunk of work: the full 
mapping of the {{JobClient}} and the Configuration . It is good to think of it 
as one big thing, keeps you more focused! Of course, I will be coming back to 
the current work via potential feedback comments. 

That's it mostly.

PS: I've mentioned in the hangouts that I may have some reworking in 
{{runtime}} stuff and will do a PR. The thing is, my requirements change a bit, 
so I will do the changes as necessary and I guess in my final PR you can check 
them out. That would be better, I think.

Cheers!


was (Author: atsikiridis):
Hello,

this week 

1) I implemented sorting for mapred. One can now specify a key comparator in 
the job's {{JobConf}} and it alters the sorting behavior of Flink. I did a 
rework of {{WritableComparator}}, by setting the custom comparator higher in 
priority. A descending order comparator test works fine.  I'm dealing with some 
final issues. The goal is to pass a full Secondary Sorting. This will prove us 
that the partitioner, the sorting  and the grouping of values before reducing 
are working! Not yet passing in all cases I have in mind, unfortunately. 
Hopefully tomorrow.

WIP branch : https://github.com/atsikiridis/incubator-flink/tree/sorting-hadoop

2) Started working for the mapping of the {{DistributedCache}}. I passed a txt 
file from Hadoop's DC to Flink's. The process seems straightforward for 
{{mapred}} and should also be a matter of this week. A lookup table test-case 
would be nice here.

 Unfortunately, it is a  deprecated interface 
(https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/filecache/DistributedCache.html)
  and {{mapreduce}} is suggested. I'll keep it for now, since we started with 
mapred. The equivalent for {{mapreduce}} seems similar( though probably more 
complex, as the API got richer :) ) and one can implement it by using the 
{{mapred}} one as a reference. In GSoC, terms: maybe the last week if there is 
time. You see after this week, I would like to focus on the complete mapping  
of {{mapred}} and in the future one (or me :) ? )  can keep it as a reference 
to implement {{mapreduce}}.

3) I have done some experiments improving the grouping of values before the 
reducer. The current implementation is very naive... But this is related to the 
sorting and hopefully these days it can glue nicely.

So, I would like to finish all matters considering sorting, distCache by the 
end of the week  and then there will be one last big chunk of work: the full 
mapping of the {{JobClient}} and the Configuration . It is good to think of it 
as one big thing, keeps you more focused! Of course, I will be coming back to 
the current work via potential comments. 

That's it mostly.

PS: I've mentioned in the hangouts that I may have some reworking in 
{{runtime}} stuff and will do a PR. The thing is, my requirements change a bit, 
so I will do the changes as necessary and I guess in my final PR you can check 
them out. That would be better, I think.

Cheers!

> GSoC Summer Project: Implement full Hadoop Compatibility Layer for 
> Stratosphere
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-838
>                 URL: https://issues.apache.org/jira/browse/FLINK-838
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a 
> full Hadoop Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki: 
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal: 
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's 
> Configuration to the one of Stratosphere. By successfully bridging the Hadoop 
> tasks with Stratosphere, we already cover the most basic Hadoop Jobs. This 
> can be determined by running some popular Hadoop examples on Stratosphere 
> (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line 
> interface) for the wrapper. Implement how will the user run them. (1 - 2 
> weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators, 
> Partitioners, Distributed Cache etc.) There are quite a few interfaces and it 
> will be a challenge to support all of them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more 
> unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature, 
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (FLINK-838) GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere

Reply via email to