[jira] Commented: (HADOOP-941) Make Hadoop Record I/O Easier to use outside Hadoop

David Bowen (JIRA) Sun, 25 Feb 2007 23:15:35 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475832
 ]


David Bowen commented on HADOOP-941:
------------------------------------


Milind,

Excuse my newbie-ness.  I didn't realize that readVLong etc were old code from 
WritableUtils.  Are these methods now duplicated in record.Utils simply to 
facilitate using the o.a.h.r package stand-alone?  That would seem unfortunate.

I still can't see that the methods are correct.  I see the sign bit removed 
from negative numbers, and I don't see where it is put back.  In any case, it 
would seem logical for writeVLong to use less than 8 bytes for small negative 
ints, and it does not appear to do that.

On a separate topic: it might be worth considering a different approach to code 
generation for record Comparators.  E.g. a generated record could have an 
additional method to return its "legend", like this:

private static final byte[] legend = { TYPE_BOOL, TYPE_FLOAT, TYPE_LONG, 
TYPE_USTRING };
public byte[] getLegend() { return legend; }

where the TYPE_* things are static final bytes.  Then you could have a 
Comparator that knows how to compare the binary forms of things that have 
legends - it just iterates over the legend using a switch statement to do the 
right thing based on the type.

I think there is a maintenance benefit in keeping the generated code as small 
and as simple as possible.  Performance-wise, this adds the overhead of a 
for-loop and a switch statement dispatch, but I don't think that would be 
significant.

- David




> Make Hadoop Record I/O Easier to use outside Hadoop
> ---------------------------------------------------
>
>                 Key: HADOOP-941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-941
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: record
>    Affects Versions: 0.10.1
>         Environment: All
>            Reporter: Milind Bhandarkar
>         Assigned To: Milind Bhandarkar
>         Attachments: jute-patch.txt
>
>
> Hadoop record I/O can be used effectively outside of Hadoop. It would 
> increase its utility if developers can use it without having to import hadoop 
> classes, or having to depend on Hadoop jars. Following changes to the current 
> translator and runtime are proposed.
> Proposed Changes:
> 1. Use java.lang.String as a native type for ustring (instead of Text.)
> 2. Provide a Buffer class as a native Java type for buffer (instead of 
> BytesWritable), so that later BytesWritable could be implemented as following 
> DDL:
> module org.apache.hadoop.io {
>   record BytesWritable {
>     buffer value;
>   }
> }
> 3. Member names in generated classes should not have prefixes 'm' before 
> their names. In the above example, the private member name would be 'value' 
> not 'mvalue' as it is done now.
> 4. Convert getters and setters to have CamelCase. e.g. in the above example 
> the getter will be:
>   public Buffer getValue();
> 5. Provide a 'swiggable' C binding, so that processing the generated C code 
> with swig allows it to be used in scripting languages such as Python and Perl.
> 6. The default --language="java" target would generate class code for records 
> that would not have Hadoop dependency on WritableComparable interface, but 
> instead would have "implements Record, Comparable". (i.e. It will not have 
> write() and readFields() methods.) An additional option "--writable" will 
> need to be specified on rcc commandline to generate classes that "implements 
> Record, WritableComparable".
> 7. Optimize generated write() and readFields() methods, so that they do not 
> have to create BinaryOutputArchive or BinaryInputArchive every time these 
> methods are called on a record.
> 8. Implement ByteInStream and ByteOutStream for C++ runtime, as they will be 
> needed for using Hadoop Record I/O with forthcoming C++ MapReduce framework 
> (currently, only FileStreams are provided.)
> 9. Generate clone() methods for records in Java i.e. the generated classes 
> should implement Cloneable.
> 10. As part of Hadoop build process, produce a tar bundle for Record I/O 
> alone. This tar bundle will contain the translator classes and ant task 
> (lib/rcc.jar), translator script (bin/rcc), Java runtime (recordio.jar) that 
> includes org.apache.hadoop.record.*, sources for the java runtime (src/java), 
> and c/c++ runtime sources with Makefiles (src/c++, src/c).
> 11. Make generated Java codes for maps and vectors use Java generics.
> These are the proposed user-visible changes. Internally, the translator will 
> be restructured so that it is easier to plug-in translators for different 
> targets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-941) Make Hadoop Record I/O Easier to use outside Hadoop

Reply via email to