Re: Does anyone have experience with using Hadoop InputFormats?
Sent from my iPad On 2014-9-24, at 上午8:13, Steve Lewis lordjoe2...@gmail.com wrote: When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3) The type - mine was extends FileInputFormatText, Text must not be a Hadoop Writable - those are not serializable but extends FileInputFormatStringBuffer, StringBuffer does work - I don't think this is allowed in Hadoop Are these statements correct and if so it seems like most Hadoop InputFormate - certainly the custom ones I create require serious modifications to work - does anyone have samples of use of Hadoop InputFormat Since I am working with problems where a directory with multiple files are processed and some files are many gigabytes in size with multiline complex records an input format is a requirement. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does anyone have experience with using Hadoop InputFormats?
I tried newAPIHadoopFile and it works except that my original InputFormat extends InputFormatText,Text and has a RecordReaderText,Text This throws a not Serializable exception on Text - changing the type to InputFormatStringBuffer, StringBuffer works with minor code changes. I do not, however, believe that Hadoop count use an InputFormat with types not derived from Writable - What were you using and was it able to work with Hadoop? On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei liquan...@gmail.com wrote: Hi Steve, Hi Steve, Did you try the newAPIHadoopFile? That worked for us. Thanks, Liquan On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com wrote: Well I had one and tried that - my message tells what I found found 1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V not org.apache.hadoop.mapreduce.InputFormatK,V 2) Hadoop expects K and V to be Writables - I always use Text - Text is not Serializable and will not work with Spark - StringBuffer will work with Spark but not (as far as I know) with Hadoop - Telling me what the documentation SAYS is all well and good but I just tried it and want hear from people with real examples working On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei liquan...@gmail.com wrote: Hi Steve, Here is my understanding, as long as you implement InputFormat, you should be able to use hadoopFile API in SparkContext to create an RDD. Suppose you have a customized InputFormat which we call CustomizedInputFormatK, V where K is the key type and V is the value type. You can create an RDD with CustomizedInputFormat in the following way: Let sc denote the SparkContext variable and path denote the path to file of CustomizedInputFormat, we use val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) to create an RDD of (K,V) with CustomizedInputFormat. Hope this helps, Liquan On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3) The type - mine was extends FileInputFormatText, Text must not be a Hadoop Writable - those are not serializable but extends FileInputFormatStringBuffer, StringBuffer does work - I don't think this is allowed in Hadoop Are these statements correct and if so it seems like most Hadoop InputFormate - certainly the custom ones I create require serious modifications to work - does anyone have samples of use of Hadoop InputFormat Since I am working with problems where a directory with multiple files are processed and some files are many gigabytes in size with multiline complex records an input format is a requirement. -- Liquan Pei Department of Physics University of Massachusetts Amherst -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Liquan Pei Department of Physics University of Massachusetts Amherst -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: Does anyone have experience with using Hadoop InputFormats?
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using Accumulo's Key and Value classes, both of which extend Writable. Works like a charm. I use the same InputFormat for all my MR jobs. -Russ On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis lordjoe2...@gmail.com wrote: I tried newAPIHadoopFile and it works except that my original InputFormat extends InputFormatText,Text and has a RecordReaderText,Text This throws a not Serializable exception on Text - changing the type to InputFormatStringBuffer, StringBuffer works with minor code changes. I do not, however, believe that Hadoop count use an InputFormat with types not derived from Writable - What were you using and was it able to work with Hadoop? On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei liquan...@gmail.com wrote: Hi Steve, Hi Steve, Did you try the newAPIHadoopFile? That worked for us. Thanks, Liquan On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com wrote: Well I had one and tried that - my message tells what I found found 1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V not org.apache.hadoop.mapreduce.InputFormatK,V 2) Hadoop expects K and V to be Writables - I always use Text - Text is not Serializable and will not work with Spark - StringBuffer will work with Spark but not (as far as I know) with Hadoop - Telling me what the documentation SAYS is all well and good but I just tried it and want hear from people with real examples working On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei liquan...@gmail.com wrote: Hi Steve, Here is my understanding, as long as you implement InputFormat, you should be able to use hadoopFile API in SparkContext to create an RDD. Suppose you have a customized InputFormat which we call CustomizedInputFormatK, V where K is the key type and V is the value type. You can create an RDD with CustomizedInputFormat in the following way: Let sc denote the SparkContext variable and path denote the path to file of CustomizedInputFormat, we use val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) to create an RDD of (K,V) with CustomizedInputFormat. Hope this helps, Liquan On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis lordjoe2...@gmail.com wrote: When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3) The type - mine was extends FileInputFormatText, Text must not be a Hadoop Writable - those are not serializable but extends FileInputFormatStringBuffer, StringBuffer does work - I don't think this is allowed in Hadoop Are these statements correct and if so it seems like most Hadoop InputFormate - certainly the custom ones I create require serious modifications to work - does anyone have samples of use of Hadoop InputFormat Since I am working with problems where a directory with multiple files are processed and some files are many gigabytes in size with multiline complex records an input format is a requirement. -- Liquan Pei Department of Physics University of Massachusetts Amherst -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Liquan Pei Department of Physics University of Massachusetts Amherst -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: Does anyone have experience with using Hadoop InputFormats?
Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text
Re: Does anyone have experience with using Hadoop InputFormats?
No, they do not implement Serializable. There are a couple of places where I've had to do a Text-String conversion but generally it hasn't been a problem. -Russ On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis lordjoe2...@gmail.com wrote: Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text
Re: Does anyone have experience with using Hadoop InputFormats?
Hmmm - I have only tested in local mode but I got an java.io.NotSerializableException: org.apache.hadoop.io.Text at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493) Here are two classes - one will work one will not the mgf file is what they read showPairRDD simply print the text read guaranteeSparkMaster calls sparkConf.setMaster(local); if there is no master defined Perhaps I need to convert Text somewhere else but I certainly don't see where package com.lordjoe.distributed.input; /** * com.lordjoe.distributed.input.MGFInputFormat * User: Steve * Date: 9/24/2014 */ import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.io.compress.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.*; import org.apache.hadoop.util.*; import java.io.*; /** * org.systemsbiology.hadoop.MGFInputFormat * Splitter that reads mgf files * nice enough to put the begin and end tags on separate lines */ public class MGFInputFormat extends FileInputFormatStringBuffer, StringBuffer implements Serializable { private String m_Extension = mgf; public MGFInputFormat() { } @SuppressWarnings(UnusedDeclaration) public String getExtension() { return m_Extension; } @SuppressWarnings(UnusedDeclaration) public void setExtension(final String pExtension) { m_Extension = pExtension; } @Override public RecordReaderStringBuffer, StringBuffer createRecordReader(InputSplit split, TaskAttemptContext context) { return new MGFFileReader(); } @Override protected boolean isSplitable(JobContext context, Path file) { final String lcName = file.getName().toLowerCase(); //noinspection RedundantIfStatement if (lcName.endsWith(gz)) return false; return true; } /** * Custom RecordReader which returns the entire file as a * single value with the name as a key * Value is the entire file * Key is the file name */ public class MGFFileReader extends RecordReaderStringBuffer, StringBuffer implements Serializable { private CompressionCodecFactory compressionCodecs = null; private long m_Start; private long m_End; private long current; private LineReader m_Input; FSDataInputStream m_RealFile; private StringBuffer key = null; private StringBuffer value = null; private Text buffer; // must be public Text getBuffer() { if(buffer == null) buffer = new Text(); return buffer; } public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { FileSplit split = (FileSplit) genericSplit; Configuration job = context.getConfiguration(); m_Start = split.getStart(); m_End = m_Start + split.getLength(); final Path file = split.getPath(); compressionCodecs = new CompressionCodecFactory(job); boolean skipFirstLine = false; final CompressionCodec codec = compressionCodecs.getCodec(file); // open the file and seek to the m_Start of the split FileSystem fs = file.getFileSystem(job); // open the file and seek to the m_Start of the split m_RealFile = fs.open(split.getPath()); if (codec != null) { CompressionInputStream inputStream = codec.createInputStream(m_RealFile); m_Input = new LineReader( inputStream ); m_End = Long.MAX_VALUE; } else { if (m_Start != 0) { skipFirstLine = true; --m_Start; m_RealFile.seek(m_Start); } m_Input = new LineReader( m_RealFile); } // not at the beginning so go to first line if (skipFirstLine) { // skip first line and re-establish m_Start. m_Start += m_Input.readLine(getBuffer()) ; } current = m_Start; if (key == null) { key = new StringBuffer(); } else { key.setLength(0); } key.append(split.getPath().getName()); if (value == null) { value = new StringBuffer(); } current = 0; } /** * look for a scan tag then read until it closes * * @return true if there is data * @throws java.io.IOException */ public boolean nextKeyValue() throws
Does anyone have experience with using Hadoop InputFormats?
When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3) The type - mine was extends FileInputFormatText, Text must not be a Hadoop Writable - those are not serializable but extends FileInputFormatStringBuffer, StringBuffer does work - I don't think this is allowed in Hadoop Are these statements correct and if so it seems like most Hadoop InputFormate - certainly the custom ones I create require serious modifications to work - does anyone have samples of use of Hadoop InputFormat Since I am working with problems where a directory with multiple files are processed and some files are many gigabytes in size with multiline complex records an input format is a requirement.