Re: Does anyone have experience with using Hadoop InputFormats?

Liquan Pei Tue, 23 Sep 2014 17:53:56 -0700

Hi Steve,

Hi Steve,


Did you try the newAPIHadoopFile? That worked for us.

Thanks,
Liquan

On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> Well I had one and tried that - my message tells what I found found
> 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V>
>  not org.apache.hadoop.mapreduce.InputFormat<K,V>
> 2) Hadoop expects K and V to be Writables - I always use Text - Text is
> not Serializable and will not work with Spark - StringBuffer will work with
> Spark but not (as far as I know) with Hadoop
> - Telling me what the documentation SAYS is all well and good but I just
> tried it and want hear from people with real examples working
>
> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote:
>
>> Hi Steve,
>>
>> Here is my understanding, as long as you implement InputFormat, you
>> should be able to use hadoopFile API in SparkContext to create an RDD.
>> Suppose you have a customized InputFormat which we call
>> CustomizedInputFormat<K, V> where K is the key type and V is the value
>> type. You can create an RDD with CustomizedInputFormat in the following way:
>>
>> Let sc denote the SparkContext variable and path denote the path to file
>> of CustomizedInputFormat, we use
>>
>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])
>>
>> to create an RDD of (K,V) with CustomizedInputFormat.
>>
>> Hope this helps,
>> Liquan
>>
>> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com>
>> wrote:
>>
>>>  When I experimented with using an InputFormat I had used in Hadoop for
>>> a long time in Hadoop I found
>>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
>>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
>>> 2) initialize needs to be called in the constructor
>>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be a
>>> Hadoop Writable - those are not serializable but extends
>>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this
>>> is allowed in Hadoop
>>>
>>> Are these statements correct and if so it seems like most Hadoop
>>> InputFormate - certainly the custom ones I create require serious
>>> modifications to work - does anyone have samples of use of Hadoop
>>> InputFormat
>>>
>>> Since I am working with problems where a directory with multiple files
>>> are processed and some files are many gigabytes in size with multiline
>>> complex records an input format is a requirement.
>>>
>>
>>
>>
>> --
>> Liquan Pei
>> Department of Physics
>> University of Massachusetts Amherst
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: Does anyone have experience with using Hadoop InputFormats?

Reply via email to