I tried newAPIHadoopFile and it works except that my original InputFormat extends InputFormat<Text,Text> and has a RecordReader<Text,Text> This throws a not Serializable exception on Text - changing the type to InputFormat<StringBuffer, StringBuffer> works with minor code changes. I do not, however, believe that Hadoop count use an InputFormat with types not derived from Writable - What were you using and was it able to work with Hadoop?
On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei <liquan...@gmail.com> wrote: > Hi Steve, > > Hi Steve, > > Did you try the newAPIHadoopFile? That worked for us. > > Thanks, > Liquan > > On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> Well I had one and tried that - my message tells what I found found >> 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V> >> not org.apache.hadoop.mapreduce.InputFormat<K,V> >> 2) Hadoop expects K and V to be Writables - I always use Text - Text is >> not Serializable and will not work with Spark - StringBuffer will work with >> Spark but not (as far as I know) with Hadoop >> - Telling me what the documentation SAYS is all well and good but I just >> tried it and want hear from people with real examples working >> >> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote: >> >>> Hi Steve, >>> >>> Here is my understanding, as long as you implement InputFormat, you >>> should be able to use hadoopFile API in SparkContext to create an RDD. >>> Suppose you have a customized InputFormat which we call >>> CustomizedInputFormat<K, V> where K is the key type and V is the value >>> type. You can create an RDD with CustomizedInputFormat in the following way: >>> >>> Let sc denote the SparkContext variable and path denote the path to file >>> of CustomizedInputFormat, we use >>> >>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, >>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) >>> >>> to create an RDD of (K,V) with CustomizedInputFormat. >>> >>> Hope this helps, >>> Liquan >>> >>> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com> >>> wrote: >>> >>>> When I experimented with using an InputFormat I had used in Hadoop for >>>> a long time in Hadoop I found >>>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the >>>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat >>>> 2) initialize needs to be called in the constructor >>>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be >>>> a Hadoop Writable - those are not serializable but extends >>>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this >>>> is allowed in Hadoop >>>> >>>> Are these statements correct and if so it seems like most Hadoop >>>> InputFormate - certainly the custom ones I create require serious >>>> modifications to work - does anyone have samples of use of Hadoop >>>> InputFormat >>>> >>>> Since I am working with problems where a directory with multiple files >>>> are processed and some files are many gigabytes in size with multiline >>>> complex records an input format is a requirement. >>>> >>> >>> >>> >>> -- >>> Liquan Pei >>> Department of Physics >>> University of Massachusetts Amherst >>> >> >> >> >> -- >> Steven M. Lewis PhD >> 4221 105th Ave NE >> Kirkland, WA 98033 >> 206-384-1340 (cell) >> Skype lordjoe_com >> >> > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com