Hi Steve, Hi Steve,
Did you try the newAPIHadoopFile? That worked for us. Thanks, Liquan On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > Well I had one and tried that - my message tells what I found found > 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V> > not org.apache.hadoop.mapreduce.InputFormat<K,V> > 2) Hadoop expects K and V to be Writables - I always use Text - Text is > not Serializable and will not work with Spark - StringBuffer will work with > Spark but not (as far as I know) with Hadoop > - Telling me what the documentation SAYS is all well and good but I just > tried it and want hear from people with real examples working > > On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote: > >> Hi Steve, >> >> Here is my understanding, as long as you implement InputFormat, you >> should be able to use hadoopFile API in SparkContext to create an RDD. >> Suppose you have a customized InputFormat which we call >> CustomizedInputFormat<K, V> where K is the key type and V is the value >> type. You can create an RDD with CustomizedInputFormat in the following way: >> >> Let sc denote the SparkContext variable and path denote the path to file >> of CustomizedInputFormat, we use >> >> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, >> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) >> >> to create an RDD of (K,V) with CustomizedInputFormat. >> >> Hope this helps, >> Liquan >> >> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com> >> wrote: >> >>> When I experimented with using an InputFormat I had used in Hadoop for >>> a long time in Hadoop I found >>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the >>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat >>> 2) initialize needs to be called in the constructor >>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be a >>> Hadoop Writable - those are not serializable but extends >>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this >>> is allowed in Hadoop >>> >>> Are these statements correct and if so it seems like most Hadoop >>> InputFormate - certainly the custom ones I create require serious >>> modifications to work - does anyone have samples of use of Hadoop >>> InputFormat >>> >>> Since I am working with problems where a directory with multiple files >>> are processed and some files are many gigabytes in size with multiline >>> complex records an input format is a requirement. >>> >> >> >> >> -- >> Liquan Pei >> Department of Physics >> University of Massachusetts Amherst >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > -- Liquan Pei Department of Physics University of Massachusetts Amherst