[ 
https://issues.apache.org/jira/browse/HADOOP-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541092
 ] 

Milind Bhandarkar commented on HADOOP-2030:
-------------------------------------------

Vivek, could you outline how the RecordInput and RecordOutput classes will look 
like (after your changes) for current XML serializer (which emits field names) 
and future JSON serializer ?

> Some changes to Record I/O interfaces
> -------------------------------------
>
>                 Key: HADOOP-2030
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2030
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Vivek Ratan
>
> I wanted to suggest some changes to the Record I/O interfaces. 
> Under org.apache.hadoop.record, _RecordInput_ and _RecordOutput_ are the 
> interfaces to serialize and deserialize basic types for Java-generated stubs. 
> All the methods in _RecordInput_ and _RecordOutput_ take a parameter, a 
> string, called 'tag'. As far as I can see, this tag is used only for 
> XML-based serialization, to write out the name of the field that is being 
> serialized.A lot of the  methods ignore it. My proposal is to eliminate this 
> parameter, for a number of reasons: 
> - We don't need to write the name of a field when serializing in XML. None of 
> the other serializers (for binary or CSV) write out the name of a field - we 
> only write the field value. The generated stubs know which field is 
> associated with which value (and now, with type information support, the 
> field name is part of the type information and is not required to be 
> serialized along with the field data). In fact, even in XML, I don't see the 
> field name being read back in, so it serves no purpose whatsoever. 
> - The tag is used occasionally in the error message, but again this can be 
> handled better by the caller of _RecordInput_ and _RecordOutput_. 
> - The tag is also used to detect whether a record is nested or not. In CSV, 
> we wrap nested records with "s{}". We also want to know whether a record is 
> nested or the top-most, so that we add a newline at the end of a top-most 
> record. If a tag is empty, it is assumed that the record is the top-most. 
> This is using the tag parameter to mean something else. It's far more 
> readable to just pass in a boolean to _startRecord()_ and _endRecord()_ which 
> directly indicates whether the record is nested or not. Or, add two 
> additional methods to _RecordOutput_ and _RecordInput_: _start()_ and 
> _stop()_, which are called at the beginning and end of every top-most record 
> while _startRecord()_ and _endRecord()_ are used only for nested records. The 
> former's slightly better, IMO, but each method is much better than using an 
> empty tag to indicate a top-level record.
> The issue with tags brings up a related issue. Sometimes, we may need to pass 
> in additional information to _RecordInput_ or _RecordOutput_. For example, 
> suppose we do need to write the field name along with the field value. We can 
> think of such a requirement in two ways. A) Such decisions of what to 
> serialize/deserialize are independent of the format/protocol that the data is 
> serialized in. If we want to write something else, that should be written 
> separately by the stub. So, if we want to serialize the field name before a 
> field value, a stub should call _RecordOutput.writeString(<field name>)_ 
> first, followed by _RecordOutput.writeInt(<field value>)_. The methods in 
> _RecordInput_ and _RecordOutput_  are the lowest level methods and they 
> should just be concerned with writing individual types.  B) What if a 
> protocol wants to write things differently? For example, we may want to write 
> the field name before the field value for XML only (for debugging sake, or 
> for whatever else). Or it may be that the field name and field value need to 
> be enclosed in certain tags that can't happen if you write them separately. 
> In these cases, methods in _RecordInput_ and _RecordOutput_ need to be passed 
> additional information. This can be done by providing an optional parameter 
> for these methods. Maybe a structure/class containing field information, or a 
> reference to the field itself (the Tag parameter was meant to serve a similar 
> purpose, but just passing in a String may be inadequate). For now, there is 
> no real need for either of these situations, so we should be OK with getting 
> rid of the tag parameter. 
> Similar changes need to be done to the C++ side, where we have _OArchive_ and 
> _IArchive_: 
> - The tag parameter needs to be removed
> - _startRecord()_ and _endRecord()_ in _OArchive_ and _IArchive_ need to take 
> a boolean parameter that indicates whether the record is nested or not
> - Currently, both _startRecord()_ and _endRecord()_ in  _IArchive_ take an 
> additional parameter, a reference to a hadoop record. This is never used 
> anywhere not required (the corresponding methods in _RecordInput_ and 
> _RecordOutput_ don't take any parameters, which is the right thing to do), 
> and should be removed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to