[
https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054250#comment-14054250
]
Artem Tsikiridis edited comment on FLINK-838 at 7/7/14 10:32 PM:
-----------------------------------------------------------------
Hello,
this week
1) I implemented sorting for mapred. One can now specify a key comparator in
the job's {{JobConf}} and it alters the sorting behavior of Flink. I did a
rework of {{WritableComparator}}, by setting the custom comparator higher in
priority. A descending order comparator test works fine. I'm dealing with some
final issues. The goal is to pass a full Secondary Sorting. This will prove us
that the partitioner, the sorting and the grouping of values before reducing
are working! Not yet passing in all cases I have in mind, unfortunately.
Hopefully tomorrow.
WIP branch : https://github.com/atsikiridis/incubator-flink/tree/sorting-hadoop
2) Started working for the mapping of the {{DistributedCache}}. I passed a txt
file from Hadoop's DC to Flink's. The process seems straightforward for
{{mapred}} and should also be a matter of this week. A lookup table test-case
would be nice here.
Unfortunately, it is a deprecated interface
(https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/filecache/DistributedCache.html)
and {{mapreduce}} is suggested. I'll keep it for now, since we started with
mapred. The equivalent for {{mapreduce}} seems similar( though probably more
complex, as the API got richer :) ) and one can implement it by using the
{{mapred}} one as a reference. In GSoC, terms: maybe the last week if there is
time. You see after this week, I would like to focus on the complete mapping
of {{mapred}} and in the future one (or me :) ? ) can keep it as a reference
to implement {{mapreduce}}.
3) I have done some experiments improving the grouping of values before the
reducer. The current implementation is very naive... But this is related to the
sorting and hopefully these days it can glue nicely.
So, I would like to finish all matters considering sorting, distCache by the
end of the week and then there will be one last big chunk of work: the full
mapping of the {{JobClient}} and the Configuration . It is good to think of it
as one big thing, keeps you more focused! Of course, I will be coming back to
the current work via potential feedback comments.
That's it mostly.
PS: I've mentioned in the hangouts that I may have some reworking in
{{runtime}} stuff and will do a PR. The thing is, my requirements change a bit,
so I will do the changes as necessary and I guess in my final PR you can check
them out. That would be better, I think.
Cheers!
was (Author: atsikiridis):
Hello,
this week
1) I implemented sorting for mapred. One can now specify a key comparator in
the job's {{JobConf}} and it alters the sorting behavior of Flink. I did a
rework of {{WritableComparator}}, by setting the custom comparator higher in
priority. A descending order comparator test works fine. I'm dealing with some
final issues. The goal is to pass a full Secondary Sorting. This will prove us
that the partitioner, the sorting and the grouping of values before reducing
are working! Not yet passing in all cases I have in mind, unfortunately.
Hopefully tomorrow.
WIP branch : https://github.com/atsikiridis/incubator-flink/tree/sorting-hadoop
2) Started working for the mapping of the {{DistributedCache}}. I passed a txt
file from Hadoop's DC to Flink's. The process seems straightforward for
{{mapred}} and should also be a matter of this week. A lookup table test-case
would be nice here.
Unfortunately, it is a deprecated interface
(https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/filecache/DistributedCache.html)
and {{mapreduce}} is suggested. I'll keep it for now, since we started with
mapred. The equivalent for {{mapreduce}} seems similar( though probably more
complex, as the API got richer :) ) and one can implement it by using the
{{mapred}} one as a reference. In GSoC, terms: maybe the last week if there is
time. You see after this week, I would like to focus on the complete mapping
of {{mapred}} and in the future one (or me :) ? ) can keep it as a reference
to implement {{mapreduce}}.
3) I have done some experiments improving the grouping of values before the
reducer. The current implementation is very naive... But this is related to the
sorting and hopefully these days it can glue nicely.
So, I would like to finish all matters considering sorting, distCache by the
end of the week and then there will be one last big chunk of work: the full
mapping of the {{JobClient}} and the Configuration . It is good to think of it
as one big thing, keeps you more focused! Of course, I will be coming back to
the current work via potential comments.
That's it mostly.
PS: I've mentioned in the hangouts that I may have some reworking in
{{runtime}} stuff and will do a PR. The thing is, my requirements change a bit,
so I will do the changes as necessary and I guess in my final PR you can check
them out. That would be better, I think.
Cheers!
> GSoC Summer Project: Implement full Hadoop Compatibility Layer for
> Stratosphere
> -------------------------------------------------------------------------------
>
> Key: FLINK-838
> URL: https://issues.apache.org/jira/browse/FLINK-838
> Project: Flink
> Issue Type: Improvement
> Reporter: GitHub Import
> Labels: github-import
> Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a
> full Hadoop Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki:
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal:
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's
> Configuration to the one of Stratosphere. By successfully bridging the Hadoop
> tasks with Stratosphere, we already cover the most basic Hadoop Jobs. This
> can be determined by running some popular Hadoop examples on Stratosphere
> (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line
> interface) for the wrapper. Implement how will the user run them. (1 - 2
> weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators,
> Partitioners, Distributed Cache etc.) There are quite a few interfaces and it
> will be a challenge to support all of them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more
> unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature,
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open
--
This message was sent by Atlassian JIRA
(v6.2#6252)