[
https://issues.apache.org/jira/browse/HADOOP-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541092
]
Milind Bhandarkar commented on HADOOP-2030:
-------------------------------------------
Vivek, could you outline how the RecordInput and RecordOutput classes will look
like (after your changes) for current XML serializer (which emits field names)
and future JSON serializer ?
> Some changes to Record I/O interfaces
> -------------------------------------
>
> Key: HADOOP-2030
> URL: https://issues.apache.org/jira/browse/HADOOP-2030
> Project: Hadoop
> Issue Type: Improvement
> Reporter: Vivek Ratan
>
> I wanted to suggest some changes to the Record I/O interfaces.
> Under org.apache.hadoop.record, _RecordInput_ and _RecordOutput_ are the
> interfaces to serialize and deserialize basic types for Java-generated stubs.
> All the methods in _RecordInput_ and _RecordOutput_ take a parameter, a
> string, called 'tag'. As far as I can see, this tag is used only for
> XML-based serialization, to write out the name of the field that is being
> serialized.A lot of the methods ignore it. My proposal is to eliminate this
> parameter, for a number of reasons:
> - We don't need to write the name of a field when serializing in XML. None of
> the other serializers (for binary or CSV) write out the name of a field - we
> only write the field value. The generated stubs know which field is
> associated with which value (and now, with type information support, the
> field name is part of the type information and is not required to be
> serialized along with the field data). In fact, even in XML, I don't see the
> field name being read back in, so it serves no purpose whatsoever.
> - The tag is used occasionally in the error message, but again this can be
> handled better by the caller of _RecordInput_ and _RecordOutput_.
> - The tag is also used to detect whether a record is nested or not. In CSV,
> we wrap nested records with "s{}". We also want to know whether a record is
> nested or the top-most, so that we add a newline at the end of a top-most
> record. If a tag is empty, it is assumed that the record is the top-most.
> This is using the tag parameter to mean something else. It's far more
> readable to just pass in a boolean to _startRecord()_ and _endRecord()_ which
> directly indicates whether the record is nested or not. Or, add two
> additional methods to _RecordOutput_ and _RecordInput_: _start()_ and
> _stop()_, which are called at the beginning and end of every top-most record
> while _startRecord()_ and _endRecord()_ are used only for nested records. The
> former's slightly better, IMO, but each method is much better than using an
> empty tag to indicate a top-level record.
> The issue with tags brings up a related issue. Sometimes, we may need to pass
> in additional information to _RecordInput_ or _RecordOutput_. For example,
> suppose we do need to write the field name along with the field value. We can
> think of such a requirement in two ways. A) Such decisions of what to
> serialize/deserialize are independent of the format/protocol that the data is
> serialized in. If we want to write something else, that should be written
> separately by the stub. So, if we want to serialize the field name before a
> field value, a stub should call _RecordOutput.writeString(<field name>)_
> first, followed by _RecordOutput.writeInt(<field value>)_. The methods in
> _RecordInput_ and _RecordOutput_ are the lowest level methods and they
> should just be concerned with writing individual types. B) What if a
> protocol wants to write things differently? For example, we may want to write
> the field name before the field value for XML only (for debugging sake, or
> for whatever else). Or it may be that the field name and field value need to
> be enclosed in certain tags that can't happen if you write them separately.
> In these cases, methods in _RecordInput_ and _RecordOutput_ need to be passed
> additional information. This can be done by providing an optional parameter
> for these methods. Maybe a structure/class containing field information, or a
> reference to the field itself (the Tag parameter was meant to serve a similar
> purpose, but just passing in a String may be inadequate). For now, there is
> no real need for either of these situations, so we should be OK with getting
> rid of the tag parameter.
> Similar changes need to be done to the C++ side, where we have _OArchive_ and
> _IArchive_:
> - The tag parameter needs to be removed
> - _startRecord()_ and _endRecord()_ in _OArchive_ and _IArchive_ need to take
> a boolean parameter that indicates whether the record is nested or not
> - Currently, both _startRecord()_ and _endRecord()_ in _IArchive_ take an
> additional parameter, a reference to a hadoop record. This is never used
> anywhere not required (the corresponding methods in _RecordInput_ and
> _RecordOutput_ don't take any parameters, which is the right thing to do),
> and should be removed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.