I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using Accumulo's Key and Value classes, both of which extend Writable. Works like a charm. I use the same InputFormat for all my MR jobs.
-Russ On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > I tried newAPIHadoopFile and it works except that my original InputFormat > extends InputFormat<Text,Text> and has a RecordReader<Text,Text> > This throws a not Serializable exception on Text - changing the type to > InputFormat<StringBuffer, StringBuffer> works with minor code changes. > I do not, however, believe that Hadoop count use an InputFormat with types > not derived from Writable - > What were you using and was it able to work with Hadoop? > > On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei <liquan...@gmail.com> wrote: > >> Hi Steve, >> >> Hi Steve, >> >> Did you try the newAPIHadoopFile? That worked for us. >> >> Thanks, >> Liquan >> >> On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com> >> wrote: >> >>> Well I had one and tried that - my message tells what I found found >>> 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V> >>> not org.apache.hadoop.mapreduce.InputFormat<K,V> >>> 2) Hadoop expects K and V to be Writables - I always use Text - Text is >>> not Serializable and will not work with Spark - StringBuffer will work with >>> Spark but not (as far as I know) with Hadoop >>> - Telling me what the documentation SAYS is all well and good but I just >>> tried it and want hear from people with real examples working >>> >>> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote: >>> >>>> Hi Steve, >>>> >>>> Here is my understanding, as long as you implement InputFormat, you >>>> should be able to use hadoopFile API in SparkContext to create an RDD. >>>> Suppose you have a customized InputFormat which we call >>>> CustomizedInputFormat<K, V> where K is the key type and V is the value >>>> type. You can create an RDD with CustomizedInputFormat in the following >>>> way: >>>> >>>> Let sc denote the SparkContext variable and path denote the path to >>>> file of CustomizedInputFormat, we use >>>> >>>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, >>>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) >>>> >>>> to create an RDD of (K,V) with CustomizedInputFormat. >>>> >>>> Hope this helps, >>>> Liquan >>>> >>>> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com> >>>> wrote: >>>> >>>>> When I experimented with using an InputFormat I had used in Hadoop >>>>> for a long time in Hadoop I found >>>>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the >>>>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat >>>>> 2) initialize needs to be called in the constructor >>>>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be >>>>> a Hadoop Writable - those are not serializable but extends >>>>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this >>>>> is allowed in Hadoop >>>>> >>>>> Are these statements correct and if so it seems like most Hadoop >>>>> InputFormate - certainly the custom ones I create require serious >>>>> modifications to work - does anyone have samples of use of Hadoop >>>>> InputFormat >>>>> >>>>> Since I am working with problems where a directory with multiple files >>>>> are processed and some files are many gigabytes in size with multiline >>>>> complex records an input format is a requirement. >>>>> >>>> >>>> >>>> >>>> -- >>>> Liquan Pei >>>> Department of Physics >>>> University of Massachusetts Amherst >>>> >>> >>> >>> >>> -- >>> Steven M. Lewis PhD >>> 4221 105th Ave NE >>> Kirkland, WA 98033 >>> 206-384-1340 (cell) >>> Skype lordjoe_com >>> >>> >> >> >> -- >> Liquan Pei >> Department of Physics >> University of Massachusetts Amherst >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > >