Re: Does anyone have experience with using Hadoop InputFormats?

Russ Weeks Wed, 24 Sep 2014 10:00:13 -0700

I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using
Accumulo's Key and Value classes, both of which extend Writable. Works like
a charm. I use the same InputFormat for all my MR jobs.


-Russ

On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> I tried newAPIHadoopFile and it works except that my original InputFormat
>  extends InputFormat<Text,Text> and has a RecordReader<Text,Text>
> This throws a not Serializable exception on Text - changing the type to
> InputFormat<StringBuffer, StringBuffer> works with minor code changes.
> I do not, however, believe that Hadoop count use an InputFormat with types
> not derived from Writable -
> What were you using and was it able to work with Hadoop?
>
> On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei <liquan...@gmail.com> wrote:
>
>> Hi Steve,
>>
>> Hi Steve,
>>
>> Did you try the newAPIHadoopFile? That worked for us.
>>
>> Thanks,
>> Liquan
>>
>> On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com>
>> wrote:
>>
>>> Well I had one and tried that - my message tells what I found found
>>> 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V>
>>>  not org.apache.hadoop.mapreduce.InputFormat<K,V>
>>> 2) Hadoop expects K and V to be Writables - I always use Text - Text is
>>> not Serializable and will not work with Spark - StringBuffer will work with
>>> Spark but not (as far as I know) with Hadoop
>>> - Telling me what the documentation SAYS is all well and good but I just
>>> tried it and want hear from people with real examples working
>>>
>>> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote:
>>>
>>>> Hi Steve,
>>>>
>>>> Here is my understanding, as long as you implement InputFormat, you
>>>> should be able to use hadoopFile API in SparkContext to create an RDD.
>>>> Suppose you have a customized InputFormat which we call
>>>> CustomizedInputFormat<K, V> where K is the key type and V is the value
>>>> type. You can create an RDD with CustomizedInputFormat in the following 
>>>> way:
>>>>
>>>> Let sc denote the SparkContext variable and path denote the path to
>>>> file of CustomizedInputFormat, we use
>>>>
>>>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
>>>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])
>>>>
>>>> to create an RDD of (K,V) with CustomizedInputFormat.
>>>>
>>>> Hope this helps,
>>>> Liquan
>>>>
>>>> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com>
>>>> wrote:
>>>>
>>>>>  When I experimented with using an InputFormat I had used in Hadoop
>>>>> for a long time in Hadoop I found
>>>>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
>>>>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
>>>>> 2) initialize needs to be called in the constructor
>>>>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be
>>>>> a Hadoop Writable - those are not serializable but extends
>>>>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this
>>>>> is allowed in Hadoop
>>>>>
>>>>> Are these statements correct and if so it seems like most Hadoop
>>>>> InputFormate - certainly the custom ones I create require serious
>>>>> modifications to work - does anyone have samples of use of Hadoop
>>>>> InputFormat
>>>>>
>>>>> Since I am working with problems where a directory with multiple files
>>>>> are processed and some files are many gigabytes in size with multiline
>>>>> complex records an input format is a requirement.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Liquan Pei
>>>> Department of Physics
>>>> University of Massachusetts Amherst
>>>>
>>>
>>>
>>>
>>> --
>>> Steven M. Lewis PhD
>>> 4221 105th Ave NE
>>> Kirkland, WA 98033
>>> 206-384-1340 (cell)
>>> Skype lordjoe_com
>>>
>>>
>>
>>
>> --
>> Liquan Pei
>> Department of Physics
>> University of Massachusetts Amherst
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Re: Does anyone have experience with using Hadoop InputFormats?

Reply via email to