Re: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Arkady Borkovsky Wed, 25 Apr 2007 12:44:14 -0700

Runping,

as we discussed yesterday, it may be better to implement more completefunctionality that would allow to specify any combination of fields tobe used as for partitioning and for sorting.This can be easily implemented top of the functionality this specificpatch provides. (By prepending the actual keys by the "streamingmapper" class, and stripping them in "streaming reducer" class beforefeeding to the streaming reducer command provided by the user.

However, at the user level, I'd suggest you export the "complete"functionality, rather than limiting it by requiring the keys to be inthe beginning of the record.


On Apr 25, 2007, at 11:13 AM, Runping Qi (JIRA) wrote:

[https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Runping Qi updated HADOOP-1284:
-------------------------------

    Description:
Right now, the protocol between stream mapper/reducer and theframework is very inflexible.The mapper/reducer generates line oriented output. The framework picksup line by line, and spliteach line into a key/value pair. By default, the substring up to thefirst tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over howthe pair is split.Here, I'd like to introduce the following configuration variables forthat:
1. "streaming.output.field.separator": the value will be the tab key,by default.
But the user can specify a different one (e.g. ':', or ', ', etc.)
A map output line can be considered as a list of fields separated bythe separator.
2. "streaming.num.fields.for.mapout.key": the number of the firstfields will be used the map output key
(and for sorting in the reduce side).
The default value is 1.
The rest of the fields will be used as the value. For example, I canspecify the first 5 fields as my mapout key.
3. "streaming.num.fields.for.partitioning": Sometimes, I want to usefewer fields for partitioning to
achieve "primary/secondary" composite
key effect as proposed in HADOOP485. The default value is 1.
For example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5.
This effectively amounts to saying that fields 4 and 5 are mysecondary key.
With the above default values, it is compatible with the currentbehavior
while introducing a new desirable feature in a clean way.

Thoughts?




  was:
Right now, the protocol between stream mapper/reducer and theframework is very inflexible.The mapper/reducer generates line oriented output. The framework picksup line by line, and spliteach line into a key/value pair. By default, the substring up to thefirst tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over howthe pair is split.Here, I'd like to introduce the following configuration variables forthat:
1. "streaming.output.field.separator": the value will be the tab key,by default. But the user can specify a different one (e.g. '|', or '', etc.)A map output line can be considered as a list of fields separated bythe separator.
2. "streaming.num.fields.for.mapout.key": the number of the firstfields will be used the map output key (and for sorting in the reduceside).
The default value is 1.
The rest of the fields will be used as the value. For example, I canspecify the first 5 fields as my mapout key.
3. "streaming.num.fields.for.partitioning": Sometimes, I want to usefewer fields for partitioning to achieve "primary/secondary" compositekey effect as proposed in HADOOP485. The default value is 1. Forexample, I can set "streaming.num.fields.for.partitioning" to 3and "streaming.num.fields.for.mapout.key" to 5. This effectivelyamounts to saying that fields 4 and 5 are my secondary key.
With the above default values, it is compatible with the currentbehavior while introducing a new desirable feature in a clean way.
Thoughts?





This patch implemented the proposed protocol.
With this patch, the streaming user can specify a field separatot forthe mapper's output and/or a field separator
for the reducer's output. The default will be the tab char.
The user can also specify how many fields in the output consitute thekeys. The default is 1.
The rest part of a line will be the value.
A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is alsoimplemented.
The user can specify the number of the fields in the map output keys
will be used for partitioning.
Also a urility class, FieldSelectionMapReduce in mapred.lib, is added.This class allows theuser to create map/reduce jobs that manapulate text data like the Unixcut utility.The user can specify field separator (delimiter for cut) and specifywhich fields to select, and
by which fields to partition/sort.

Two unit tests are introduced.
All the unit tests passed.
[ Show » ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented theproposed protocol. With this patch, the streaming user can specify afield separatot for the mapper's output and/or a field separator forthe reducer's output. The default will be the tab char. The user canalso specify how many fields in the output consitute the keys. Thedefault is 1. The rest part of a line will be the value. A partitionerclass, KeyFieldBasedPartitioner in mapred.lib, is also implemented.The user can specify the number of the fields in the map output keyswill be used for partitioning. Also a urility class,FieldSelectionMapReduce in mapred.lib, is added. This class allows theuser to create map/reduce jobs that manapulate text data like the Unixcut utility. The user can specify field separator (delimiter for cut)and specify which fields to select, and by which fields topartition/sort. Two unit tests are introduced. All the unit testspassed.
clean up the protocol between stream mapper/reducer and the framework
---------------------------------------------------------------------

                Key: HADOOP-1284
                URL: https://issues.apache.org/jira/browse/HADOOP-1284
            Project: Hadoop
         Issue Type: Improvement
           Reporter: Runping Qi
        Assigned To: Runping Qi
        Attachments: patch-1284.txt
Right now, the protocol between stream mapper/reducer and theframework is very inflexible.The mapper/reducer generates line oriented output. The frameworkpicks up line by line, and spliteach line into a key/value pair. By default, the substring up to thefirst tab char is the key, and the
substring after the first tab char is the value.
However, in many cases, the application wants some control over howthe pair is split.Here, I'd like to introduce the following configuration variables forthat:1. "streaming.output.field.separator": the value will be the tab key,by default.
But the user can specify a different one (e.g. ':', or ', ', etc.)
A map output line can be considered as a list of fields separated bythe separator.2. "streaming.num.fields.for.mapout.key": the number of the firstfields will be used the map output key
(and for sorting in the reduce side).
The default value is 1.
The rest of the fields will be used as the value. For example, I canspecify the first 5 fields as my mapout key.3. "streaming.num.fields.for.partitioning": Sometimes, I want to usefewer fields for partitioning to
achieve "primary/secondary" composite
key effect as proposed in HADOOP485. The default value is 1.
For example, I can set "streaming.num.fields.for.partitioning" to 3
and "streaming.num.fields.for.mapout.key" to 5.
This effectively amounts to saying that fields 4 and 5 are mysecondary key.With the above default values, it is compatible with the currentbehavior
while introducing a new desirable feature in a clean way.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework

Reply via email to