[jira] Commented: (HADOOP-2030) Some changes to Record I/O interfaces

Vivek Ratan (JIRA) Wed, 07 Nov 2007 03:20:14 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540718
 ]


Vivek Ratan commented on HADOOP-2030:
-------------------------------------

There are a couple of different issues here, so let me address them separately. 

While we currently do not need to pass in any information, such as field name, 
to RecordOutput or RecordInput, it's possible that we'll need something in the 
future. In that case, instead of a string, I suggest passing in some class that 
can contain whatever information you need to pass. See my writeup in the 
original comment. A string tag is too specific. 

Regarding top-level records, there are two issues with our current setup: 
1. There are two public serialize() methods in Record, one that takes a tag and 
one that doesn't. The user should really call the one that doesn't take a tag. 
The other one is called by the serialize method of the top-level record. 
2. Absence or presence of a tag is a bad way to indicate whether we're 
de/serializing a top level record or not. 
There should really only be one serialize() method available to the user, and 
this is the one that the user calls for a top-level records. This method should 
take in a RecordOutput. When serializing nested records, our generated code 
should call some other method, which would indicate that the call is to a 
nested record. This method should not be accessible to the user (using 
protected/private or some such thing, or perhaps using a different method 
name). 

I'll play with the code and see if this can be done in Java and C++. 

Regardless, the 'tag' parameter should be dispensed with, as both these issues 
can be handled in a better way without it. 


> Some changes to Record I/O interfaces
> -------------------------------------
>
>                 Key: HADOOP-2030
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2030
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Vivek Ratan
>
> I wanted to suggest some changes to the Record I/O interfaces. 
> Under org.apache.hadoop.record, _RecordInput_ and _RecordOutput_ are the 
> interfaces to serialize and deserialize basic types for Java-generated stubs. 
> All the methods in _RecordInput_ and _RecordOutput_ take a parameter, a 
> string, called 'tag'. As far as I can see, this tag is used only for 
> XML-based serialization, to write out the name of the field that is being 
> serialized.A lot of the  methods ignore it. My proposal is to eliminate this 
> parameter, for a number of reasons: 
> - We don't need to write the name of a field when serializing in XML. None of 
> the other serializers (for binary or CSV) write out the name of a field - we 
> only write the field value. The generated stubs know which field is 
> associated with which value (and now, with type information support, the 
> field name is part of the type information and is not required to be 
> serialized along with the field data). In fact, even in XML, I don't see the 
> field name being read back in, so it serves no purpose whatsoever. 
> - The tag is used occasionally in the error message, but again this can be 
> handled better by the caller of _RecordInput_ and _RecordOutput_. 
> - The tag is also used to detect whether a record is nested or not. In CSV, 
> we wrap nested records with "s{}". We also want to know whether a record is 
> nested or the top-most, so that we add a newline at the end of a top-most 
> record. If a tag is empty, it is assumed that the record is the top-most. 
> This is using the tag parameter to mean something else. It's far more 
> readable to just pass in a boolean to _startRecord()_ and _endRecord()_ which 
> directly indicates whether the record is nested or not. Or, add two 
> additional methods to _RecordOutput_ and _RecordInput_: _start()_ and 
> _stop()_, which are called at the beginning and end of every top-most record 
> while _startRecord()_ and _endRecord()_ are used only for nested records. The 
> former's slightly better, IMO, but each method is much better than using an 
> empty tag to indicate a top-level record.
> The issue with tags brings up a related issue. Sometimes, we may need to pass 
> in additional information to _RecordInput_ or _RecordOutput_. For example, 
> suppose we do need to write the field name along with the field value. We can 
> think of such a requirement in two ways. A) Such decisions of what to 
> serialize/deserialize are independent of the format/protocol that the data is 
> serialized in. If we want to write something else, that should be written 
> separately by the stub. So, if we want to serialize the field name before a 
> field value, a stub should call _RecordOutput.writeString(<field name>)_ 
> first, followed by _RecordOutput.writeInt(<field value>)_. The methods in 
> _RecordInput_ and _RecordOutput_  are the lowest level methods and they 
> should just be concerned with writing individual types.  B) What if a 
> protocol wants to write things differently? For example, we may want to write 
> the field name before the field value for XML only (for debugging sake, or 
> for whatever else). Or it may be that the field name and field value need to 
> be enclosed in certain tags that can't happen if you write them separately. 
> In these cases, methods in _RecordInput_ and _RecordOutput_ need to be passed 
> additional information. This can be done by providing an optional parameter 
> for these methods. Maybe a structure/class containing field information, or a 
> reference to the field itself (the Tag parameter was meant to serve a similar 
> purpose, but just passing in a String may be inadequate). For now, there is 
> no real need for either of these situations, so we should be OK with getting 
> rid of the tag parameter. 
> Similar changes need to be done to the C++ side, where we have _OArchive_ and 
> _IArchive_: 
> - The tag parameter needs to be removed
> - _startRecord()_ and _endRecord()_ in _OArchive_ and _IArchive_ need to take 
> a boolean parameter that indicates whether the record is nested or not
> - Currently, both _startRecord()_ and _endRecord()_ in  _IArchive_ take an 
> additional parameter, a reference to a hadoop record. This is never used 
> anywhere not required (the corresponding methods in _RecordInput_ and 
> _RecordOutput_ don't take any parameters, which is the right thing to do), 
> and should be removed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2030) Some changes to Record I/O interfaces

Reply via email to