Re: Does anyone have experience with using Hadoop InputFormats?

2015-08-01 Thread Antsy.Rao


Sent from my iPad

On 2014-9-24, at 上午8:13, Steve Lewis lordjoe2...@gmail.com wrote:

  When I experimented with using an InputFormat I had used in Hadoop for a 
 long time in Hadoop I found
 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated 
 class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
 2) initialize needs to be called in the constructor
 3) The type - mine was extends FileInputFormatText, Text must not be a 
 Hadoop Writable - those are not serializable but extends 
 FileInputFormatStringBuffer, StringBuffer does work - I don't think this is 
 allowed in Hadoop 
 
 Are these statements correct and if so it seems like most Hadoop InputFormate 
 - certainly the custom ones I create require serious modifications to work - 
 does anyone have samples of use of Hadoop InputFormat 
 
 Since I am working with problems where a directory with multiple files are 
 processed and some files are many gigabytes in size with multiline complex 
 records an input format is a requirement.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
I tried newAPIHadoopFile and it works except that my original InputFormat
 extends InputFormatText,Text and has a RecordReaderText,Text
This throws a not Serializable exception on Text - changing the type to
InputFormatStringBuffer, StringBuffer works with minor code changes.
I do not, however, believe that Hadoop count use an InputFormat with types
not derived from Writable -
What were you using and was it able to work with Hadoop?

On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Steve,

 Hi Steve,

 Did you try the newAPIHadoopFile? That worked for us.

 Thanks,
 Liquan

 On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

 Well I had one and tried that - my message tells what I found found
 1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V
  not org.apache.hadoop.mapreduce.InputFormatK,V
 2) Hadoop expects K and V to be Writables - I always use Text - Text is
 not Serializable and will not work with Spark - StringBuffer will work with
 Spark but not (as far as I know) with Hadoop
 - Telling me what the documentation SAYS is all well and good but I just
 tried it and want hear from people with real examples working

 On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Steve,

 Here is my understanding, as long as you implement InputFormat, you
 should be able to use hadoopFile API in SparkContext to create an RDD.
 Suppose you have a customized InputFormat which we call
 CustomizedInputFormatK, V where K is the key type and V is the value
 type. You can create an RDD with CustomizedInputFormat in the following way:

 Let sc denote the SparkContext variable and path denote the path to file
 of CustomizedInputFormat, we use

 val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
 ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])

 to create an RDD of (K,V) with CustomizedInputFormat.

 Hope this helps,
 Liquan

 On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

  When I experimented with using an InputFormat I had used in Hadoop for
 a long time in Hadoop I found
 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
 deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
 2) initialize needs to be called in the constructor
 3) The type - mine was extends FileInputFormatText, Text must not be
 a Hadoop Writable - those are not serializable but extends
 FileInputFormatStringBuffer, StringBuffer does work - I don't think this
 is allowed in Hadoop

 Are these statements correct and if so it seems like most Hadoop
 InputFormate - certainly the custom ones I create require serious
 modifications to work - does anyone have samples of use of Hadoop
 InputFormat

 Since I am working with problems where a directory with multiple files
 are processed and some files are many gigabytes in size with multiline
 complex records an input format is a requirement.




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using
Accumulo's Key and Value classes, both of which extend Writable. Works like
a charm. I use the same InputFormat for all my MR jobs.

-Russ

On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis lordjoe2...@gmail.com wrote:

 I tried newAPIHadoopFile and it works except that my original InputFormat
  extends InputFormatText,Text and has a RecordReaderText,Text
 This throws a not Serializable exception on Text - changing the type to
 InputFormatStringBuffer, StringBuffer works with minor code changes.
 I do not, however, believe that Hadoop count use an InputFormat with types
 not derived from Writable -
 What were you using and was it able to work with Hadoop?

 On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Steve,

 Hi Steve,

 Did you try the newAPIHadoopFile? That worked for us.

 Thanks,
 Liquan

 On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

 Well I had one and tried that - my message tells what I found found
 1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V
  not org.apache.hadoop.mapreduce.InputFormatK,V
 2) Hadoop expects K and V to be Writables - I always use Text - Text is
 not Serializable and will not work with Spark - StringBuffer will work with
 Spark but not (as far as I know) with Hadoop
 - Telling me what the documentation SAYS is all well and good but I just
 tried it and want hear from people with real examples working

 On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Steve,

 Here is my understanding, as long as you implement InputFormat, you
 should be able to use hadoopFile API in SparkContext to create an RDD.
 Suppose you have a customized InputFormat which we call
 CustomizedInputFormatK, V where K is the key type and V is the value
 type. You can create an RDD with CustomizedInputFormat in the following 
 way:

 Let sc denote the SparkContext variable and path denote the path to
 file of CustomizedInputFormat, we use

 val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
 ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])

 to create an RDD of (K,V) with CustomizedInputFormat.

 Hope this helps,
 Liquan

 On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

  When I experimented with using an InputFormat I had used in Hadoop
 for a long time in Hadoop I found
 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
 deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
 2) initialize needs to be called in the constructor
 3) The type - mine was extends FileInputFormatText, Text must not be
 a Hadoop Writable - those are not serializable but extends
 FileInputFormatStringBuffer, StringBuffer does work - I don't think this
 is allowed in Hadoop

 Are these statements correct and if so it seems like most Hadoop
 InputFormate - certainly the custom ones I create require serious
 modifications to work - does anyone have samples of use of Hadoop
 InputFormat

 Since I am working with problems where a directory with multiple files
 are processed and some files are many gigabytes in size with multiline
 complex records an input format is a requirement.




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Do your custom Writable classes implement Serializable - I think that is
the only real issue - my code uses vanilla Text


Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
No, they do not implement Serializable. There are a couple of places where
I've had to do a Text-String conversion but generally it hasn't been a
problem.
-Russ

On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis lordjoe2...@gmail.com wrote:

 Do your custom Writable classes implement Serializable - I think that is
 the only real issue - my code uses vanilla Text




Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Hmmm - I have only tested in local mode but I got an
java.io.NotSerializableException: org.apache.hadoop.io.Text
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180)
 at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
Here are two classes - one will work one will not
the mgf file is what they read

showPairRDD simply print  the text read
guaranteeSparkMaster calls sparkConf.setMaster(local); if there is no
master defined

Perhaps I need to convert Text somewhere else but I certainly don't see
where
package com.lordjoe.distributed.input;

/**
 * com.lordjoe.distributed.input.MGFInputFormat
 * User: Steve
 * Date: 9/24/2014
 */

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.util.*;

import java.io.*;

/**
 * org.systemsbiology.hadoop.MGFInputFormat
 * Splitter that reads mgf files
 * nice enough to put the begin and end tags on separate lines
 */
public class MGFInputFormat extends FileInputFormatStringBuffer, StringBuffer  implements Serializable {

private String m_Extension = mgf;

public MGFInputFormat() {

}

@SuppressWarnings(UnusedDeclaration)
public String getExtension() {
return m_Extension;
}
 @SuppressWarnings(UnusedDeclaration)
public void setExtension(final String pExtension) {
m_Extension = pExtension;
}


@Override
public RecordReaderStringBuffer, StringBuffer createRecordReader(InputSplit split,
   TaskAttemptContext context) {
return new MGFFileReader();
}

@Override
protected boolean isSplitable(JobContext context, Path file) {
final String lcName = file.getName().toLowerCase();
//noinspection RedundantIfStatement
if (lcName.endsWith(gz))
return false;
return true;
}

/**
 * Custom RecordReader which returns the entire file as a
 * single value with the name as a key
 * Value is the entire file
 * Key is the file name
 */
public class MGFFileReader extends RecordReaderStringBuffer, StringBuffer implements Serializable {

private CompressionCodecFactory compressionCodecs = null;
private long m_Start;
private long m_End;
private long current;
private LineReader m_Input;
FSDataInputStream m_RealFile;
private StringBuffer key = null;
private StringBuffer value = null;
private  Text buffer; // must be

public Text getBuffer() {
if(buffer == null)
buffer = new Text();
return buffer;
}

public void initialize(InputSplit genericSplit,
   TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
m_Start = split.getStart();
m_End = m_Start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
boolean skipFirstLine = false;
final CompressionCodec codec = compressionCodecs.getCodec(file);

// open the file and seek to the m_Start of the split
FileSystem fs = file.getFileSystem(job);
// open the file and seek to the m_Start of the split
m_RealFile = fs.open(split.getPath());
if (codec != null) {
CompressionInputStream inputStream = codec.createInputStream(m_RealFile);
m_Input = new LineReader( inputStream );
m_End = Long.MAX_VALUE;
}
else {
if (m_Start != 0) {
skipFirstLine = true;
--m_Start;
m_RealFile.seek(m_Start);
}
m_Input = new LineReader( m_RealFile);
}
// not at the beginning so go to first line
if (skipFirstLine) {  // skip first line and re-establish m_Start.
m_Start += m_Input.readLine(getBuffer()) ;
}

current = m_Start;
if (key == null) {
key = new StringBuffer();
}
else {
key.setLength(0);
}
key.append(split.getPath().getName());
if (value == null) {
value = new StringBuffer();
}

current = 0;
}

/**
 * look for a scan tag then read until it closes
 *
 * @return true if there is data
 * @throws java.io.IOException
 */
public boolean nextKeyValue() throws 

Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
 When I experimented with using an InputFormat I had used in Hadoop for a
long time in Hadoop I found
1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated
class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
2) initialize needs to be called in the constructor
3) The type - mine was extends FileInputFormatText, Text must not be a
Hadoop Writable - those are not serializable but extends
FileInputFormatStringBuffer, StringBuffer does work - I don't think this
is allowed in Hadoop

Are these statements correct and if so it seems like most Hadoop
InputFormate - certainly the custom ones I create require serious
modifications to work - does anyone have samples of use of Hadoop
InputFormat

Since I am working with problems where a directory with multiple files are
processed and some files are many gigabytes in size with multiline complex
records an input format is a requirement.