Re: Does anyone have experience with using Hadoop InputFormats?

Steve Lewis Wed, 24 Sep 2014 09:33:59 -0700

I tried newAPIHadoopFile and it works except that my original InputFormat
 extends InputFormat<Text,Text> and has a RecordReader<Text,Text>
This throws a not Serializable exception on Text - changing the type to
InputFormat<StringBuffer, StringBuffer> works with minor code changes.
I do not, however, believe that Hadoop count use an InputFormat with types
not derived from Writable -
What were you using and was it able to work with Hadoop?


On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei <liquan...@gmail.com> wrote:

> Hi Steve,
>
> Hi Steve,
>
> Did you try the newAPIHadoopFile? That worked for us.
>
> Thanks,
> Liquan
>
> On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>> Well I had one and tried that - my message tells what I found found
>> 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V>
>>  not org.apache.hadoop.mapreduce.InputFormat<K,V>
>> 2) Hadoop expects K and V to be Writables - I always use Text - Text is
>> not Serializable and will not work with Spark - StringBuffer will work with
>> Spark but not (as far as I know) with Hadoop
>> - Telling me what the documentation SAYS is all well and good but I just
>> tried it and want hear from people with real examples working
>>
>> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote:
>>
>>> Hi Steve,
>>>
>>> Here is my understanding, as long as you implement InputFormat, you
>>> should be able to use hadoopFile API in SparkContext to create an RDD.
>>> Suppose you have a customized InputFormat which we call
>>> CustomizedInputFormat<K, V> where K is the key type and V is the value
>>> type. You can create an RDD with CustomizedInputFormat in the following way:
>>>
>>> Let sc denote the SparkContext variable and path denote the path to file
>>> of CustomizedInputFormat, we use
>>>
>>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
>>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])
>>>
>>> to create an RDD of (K,V) with CustomizedInputFormat.
>>>
>>> Hope this helps,
>>> Liquan
>>>
>>> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com>
>>> wrote:
>>>
>>>>  When I experimented with using an InputFormat I had used in Hadoop for
>>>> a long time in Hadoop I found
>>>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
>>>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
>>>> 2) initialize needs to be called in the constructor
>>>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be
>>>> a Hadoop Writable - those are not serializable but extends
>>>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this
>>>> is allowed in Hadoop
>>>>
>>>> Are these statements correct and if so it seems like most Hadoop
>>>> InputFormate - certainly the custom ones I create require serious
>>>> modifications to work - does anyone have samples of use of Hadoop
>>>> InputFormat
>>>>
>>>> Since I am working with problems where a directory with multiple files
>>>> are processed and some files are many gigabytes in size with multiline
>>>> complex records an input format is a requirement.
>>>>
>>>
>>>
>>>
>>> --
>>> Liquan Pei
>>> Department of Physics
>>> University of Massachusetts Amherst
>>>
>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Does anyone have experience with using Hadoop InputFormats?

Reply via email to