from:"Steve Kuo"

What is the best to terminate a Map job without it being retried

2010-10-08 Thread Steve Kuo

I have a collection of dirty data files, which I can detect during the
setup() phase of my Map job.  It would be best that I can quit the map job
and prevent it from being retried again.  What is the best practice to do
this?

Thanks in advance.

Re: Parallel out-of-order Map capability?

2010-10-06 Thread Steve Kuo

How about batching up calls from n line, make a synchronous call to server
with the batch, get the batch results and go through the result set one by
one?  This assumes the server can return batched calls in order.

Re: Access Error

2010-10-06 Thread Steve Kuo

This could be caused by different user accounts.

Is the user "hadoop" when running job on the master and "bhardy" on remote
client?

Re: How to config Map only job to read .gz input files and output result in .lzo

2010-09-28 Thread Steve Kuo

Thanks, Ed.  It works like a charm.

How to config Map only job to read .gz input files and output result in .lzo

2010-09-28 Thread Steve Kuo

We have TB worth of XML data in .gz format where each file is about 20 MB.
This dataset is not expected to change.  My goal is to write a map-only job
to read in one .gz file at a time and output the result in .lzo format.
Since there are a large number of .gz files, the map parallelism is expected
to be maximized.  I am using Kevin Weil's LZO distribution and there does
not seem to be a LzoTextOutputFormat.  When I got lzo to work before, I set
InputFormatClass to LzoTextInputFormat.class and map's output got lzo
compressed automatically.  What does one configure for LZO output.

Current Job configuration code listed below does not work.  XmlInputFormat
is my custom input format to read XML files.

job.setInputFormatClass(XmlInputFormat.class);
job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

String mapredOutputCompress = conf.get("mapred.output.compress");
if ("true".equals(mapredOutputCompress))
// this reads input and write output in lzo format
job.setInputFormatClass(LzoTextInputFormat.class);

Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

2010-08-07 Thread Steve Kuo

Thanks Allen for your advice.

Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

2010-08-05 Thread Steve Kuo

As part of our experimentation, the plan is to pull 4 slave nodes out of a
8-slave/1-master cluster. With replication factor set to 3, I thought
losing half of the cluster may be too much for hdfs to recover. Thus I
copied out all relevant data from hdfs to local disk and reconfigure the
cluster.

The 4 slave nodes started okay but hdfs never left safe mode. The nn.log
has the following line. What is the best way to deal with this? Shall I
restart the cluster with 8-node and then delete
/data/hadoop-hadoop/mapred/system? Or shall I reformat hdfs?

Thanks.

2010-08-05 22:28:12,921 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=hadoop,hadoop ip=/10.128.135.100 cmd=listStatus
src=/data/hadoop-hadoop/mapred/system dst=nullperm=null
2010-08-05 22:28:12,923 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 9000, call delete(/data/hadoop-hadoop/mapred/system, true) from
10.128.135.100:52368: error:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/data/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 64 needs additional 3 blocks to reach the threshold
0.9990 of total blocks 68. Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/data/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 64 needs additional 3 blocks to reach the threshold
0.9990 of total blocks 68. Safe mode will be turned off automatically.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1741)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1721)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:565)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)

Re: Problem with large .lzo files

2010-02-22 Thread Steve Kuo

Splitting the problem file into two smaller ones allowed the process to
succeed.  This, I believe pointed to some bug in LzoDecompressor.

Re: Problem with large .lzo files

2010-02-22 Thread Steve Kuo

Here is the ArrayIndexOutOfBoundException from LzoDecompressor.  I have 30
800MB of lzo files.  16 of them were processed by the Mapper successfully
with splits around 60.  File part-r-00012.lzo failed at split 45.  It's
clear that it was not size related.  I have copy out part-r-00012.lzo, split
it in half, and lzo'ed the small files.  I am now running the same set of
input but with the smaller files instead to see if the process complete.

Below is the abridged stack trace.

2010-02-21 11:15:34,718 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processNa
me=MAP, sessionId=
2010-02-21 11:15:34,909 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
100
2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: record buffer
= 262144/327680
2010-02-21 11:15:34,972 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
Loaded native gpl library
2010-02-21 11:15:34,977 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo
 library [hadoop-lzo rev fatal: Not a git repository (or any of the parent
directories): .git]
2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz:
InputFile=hdfs://foo-ndbig03.lax1.foo.com:9000/
user/hadoop/baz/faz_old/part-r-00012.lzo
2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: maxFazCount
=1000
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 18821149; bufvoid = 9961
4720
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0;
kvend = 262144; length = 327680
2010-02-21 11:15:39,125 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2010-02-21 11:15:39,738 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: bufstart =
18821149; bufend = 37722479; bufvoid
 = 99614720
2010-02-21 11:15:41,918 INFO org.apache.hadoop.mapred.MapTask: kvstart =
262144; kvend = 196607; length = 3276
80
2010-02-21 11:15:42,642 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 1
2010-02-21 11:15:44,564 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: bufstart =
37722479; bufend = 56518058; bufvoid
 = 99614720
2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: kvstart =
196607; kvend = 131070; length = 3276
80
2010-02-21 11:15:45,485 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 2

...
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: bufstart =
32585253; bufend = 51221187; bufvoid
 = 99614720
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65501; kvend = 327645; length = 32768
0
2010-02-21 11:17:34,056 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 44
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: bufstart =
51221187; bufend = 70104763; bufvoid
 = 99614720
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: kvstart =
327645; kvend = 262108; length = 3276
80
2010-02-21 11:17:36,731 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 45
2010-02-21 11:17:37,802 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.lang.ArrayIndexOutOfBoundsException
at
com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:200)
at
com.hadoop.compression.lzo.LzopDecompressor.setInput(LzopDecompressor.java:98)
at
com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:297)
at
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:232)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:85)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
at
com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2010-02-21 11:17:37,809 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
cleanup for the task

Re: Problem with large .lzo files

2010-02-15 Thread Steve Kuo

You should be able to downcast the InputSplit to FileSplit, if you're
> using the new API. From there you can get the start and length of the
> split.
>
> Cool, let me give it a shot.


> Interesting. If you can somehow make a reproducible test case I'd be
> happy to look into this.
>
> This sounds great.  As the input file is 1G, let me do some work on my side
to see if I can pinpoint it so as not have to transfer a 1G file around.

Thanks.

Re: Problem with large .lzo files

2010-02-15 Thread Steve Kuo

On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon  wrote:

> Hi Steve,
>
> I'm not sure here whether you mean that the DistributedLzoIndexer job
> is failing, or if the job on the resulting split file is faiing. Could
> you clarify?
>
>
DistributedLzoIndexer job did successfully complete.  It was one of the jobs
on the resulting split file always failed while other splits succeeded.

By the way, if all files have been indexed, DistributedLzoIndexer does not
detect that and hadoop throws an exception complaining that the input dir
(or file) does not exist.  I work around this by catching the exception.


> >   - It's possible to sacrifice parallelism by having hadoop work on each
> >   .lzo file without indexing.  This worked well until the file size
> exceeded
> >   30G when array indexing exception got thrown.  Apparently the code
> processed
> >   the file in chunks and stored the references to the chunk in an array.
>  When
> >   the number of chunks was greater than a certain number (around 256 was
> my
> >   recollection), exception was thrown.
> >   - My current work around is to increase the number of reducers to keep
> >   the .lzo file size low.
> >
> > I would like to get advices on how people handle large .lzo files.  Any
> > pointers on the cause of the stack trace below and best way to resolve it
> > are greatly appreciated.
> >
>
> Is this reproducible every time? If so, is it always at the same point
> in the LZO file that it occurs?
>
> It's at the same point.  Do you know how to print out the lzo index for the
task?  I only print out the input file now.


> Would it be possible to download that lzo file to your local box and
> use lzop -d to see if it decompresses successfully? That way we can
> isolate whether it's a compression bug or decompression.
>
> Bothe java LzoDecompressor and lzop -d were able to decompress the file
correctly.  As a matter of fact, my job does not index .lzo files now but
process each as a whole and it works

Problem with large .lzo files

2010-02-14 Thread Steve Kuo

I am running a hadoop job that combines daily results with results with
previous days.  The reduce output is lzo compressed and growing daily in
size.


   - DistributedLzoIndexer is used to index lzo files to provide
   parallelism.  When the size of the lzo files were small, everything went
   well.  As the size of .lzo files grow, the chance that one of the partition
   does not complete increases.  The exception I got for one such case is
   listed at the end of the post.
   - It's possible to sacrifice parallelism by having hadoop work on each
   .lzo file without indexing.  This worked well until the file size exceeded
   30G when array indexing exception got thrown.  Apparently the code processed
   the file in chunks and stored the references to the chunk in an array.  When
   the number of chunks was greater than a certain number (around 256 was my
   recollection), exception was thrown.
   - My current work around is to increase the number of reducers to keep
   the .lzo file size low.

I would like to get advices on how people handle large .lzo files.  Any
pointers on the cause of the stack trace below and best way to resolve it
are greatly appreciated.


Task Logs: 'attempt_201002041203_0032_m_000134_3'

syslog logs
2010-02-13 19:59:51,839 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processNa
me=MAP, sessionId=
2010-02-13 19:59:52,045 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
100
2010-02-13 19:59:52,094 INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
2010-02-13 19:59:52,094 INFO org.apache.hadoop.mapred.MapTask: record buffer
= 262144/327680
2010-02-13 19:59:52,099 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
Loaded native gpl library
2010-02-13 19:59:52,111 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo
 library [hadoop-lzo rev fatal: Not a git repository (or any of the parent
directories): .git]
2010-02-13 19:59:52,144 INFO com.baz.facetmr.mapreduce.foo.FooBar:
InputFile=hdfs://ndbig03:9000/user/hadoop/f
oo/basr_old/part-r-0.lzo
2010-02-13 19:59:52,144 INFO com.baz.facetmr.mapreduce.foo.FooBar:
maxBarCount =1000
2010-02-13 19:59:55,267 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 19:59:55,268 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 18641936; bufvoid = 9961
4720
2010-02-13 19:59:55,268 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0;
kvend = 262144; length = 327680
2010-02-13 19:59:55,958 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2010-02-13 19:59:56,730 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: bufstart =
18641936; bufend = 37532762; bufvoid
 = 99614720
2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: kvstart =
262144; kvend = 196607; length = 3276
80
2010-02-13 19:59:59,487 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 1
2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: bufstart =
37532762; bufend = 56161807; bufvoid
 = 99614720
2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: kvstart =
196607; kvend = 131070; length = 3276
80
2010-02-13 20:00:02,282 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 2
2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: bufstart =
56161807; bufend = 74935116; bufvoid
 = 99614720
2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: kvstart =
131070; kvend = 65533; length = 32768
0
2010-02-13 20:00:05,100 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 3
2010-02-13 20:00:06,177 FATAL org.apache.hadoop.mapred.TaskTracker: Error
running child : java.lang.InternalEr
ror: lzo1x_decompress_safe returned:
at
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
at
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303)
at
com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:104)
at
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:85)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
at
com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.

Re: isSplitable() deprecated

2010-01-11 Thread Steve Kuo

Ted,

You may want to consider LZO compression, which allows splitting for a
comporessed file for Map jobs.  On the other hand, gzip is not splittable.

Check out these links.

http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/
http://wiki.apache.org/hadoop/UsingLzoCompression


On Fri, Jan 8, 2010 at 1:13 PM, Ted Yu  wrote:

> The input file is in .gz format
> FYI
>
> On Fri, Jan 8, 2010 at 11:08 AM, Ted Yu  wrote:
>
> > My current project processes input file of size 02161 bytes.
> > What I plan to do is to split the file into equal size pieces (and on
> blank
> > line boundary) to improve performance.
> >
> > I found 12 classes in 0.20.1 source code which implement InputSplit.
> >
> > If someone has written code similar to what I plan to do, please share
> some
> > hint.
> >
> > Thanks
> >
> >
> > On Fri, Jan 8, 2010 at 2:27 AM, Amogh Vasekar 
> wrote:
> >
> >> Hi,
> >> The deprecation is due to the new evolving mapreduce ( o.a.h.mapreduce )
> >> APIs. Old APIs are supported for available distributions. The equivalent
> of
> >> TextInputFormat is available in new API :
> >>
> >>
> >>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html
> >>
> >> Thanks,
> >> Amogh
> >>
> >>
> >> On 1/8/10 3:47 AM, "Ted Yu"  wrote:
> >>
> >> According to:
> >>
> >>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html#isSplitable%28org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path%29
> >>
> >> isSplitable() is deprecated.
> >>
> >> Which method should I use to replace it ?
> >>
> >> Thanks
> >>
> >>
> >
>

Re: Cannot pass dynamic values by Configuration.Set()

2010-01-05 Thread Steve Kuo

There seemed to be a change between 0.20 and 0.19 API in that 0.20 no longer
set "map.input.file".  config.set(), as far as I can tell, should work.  I
however use the following to pass the parameters.

String[] params = new String[] { "-D", "tag1=string_value", ...}

ToolRunner(new Configuration(), someJob.class, params);


On Mon, Jan 4, 2010 at 9:52 AM, Farhan Husain wrote:

> Hello all,
>
> I am using hadoop-0.20.1. I need to know the input file name in my map
> processes and pass an integer and a string to my reducer processes. I used
> the following method calls for that:
>
> config.set("tag1", "string_value");
> config.setInt("tag2", int_value);
>
> In setup() method of mapper:
> String filename =
> context.getConfiguration().get("map.input.file")// returns null
>
> In setup() method of reducer:
> String val =
> context.getConfiguration().get("tag1");//
> returns null
> int n = context.getConfiguration().getInt("tag2",
> def_val);// returns def_val
>
> Can anyone please tell me what may be wrong with this code or anything
> related to it? Is it a bug of this version of Hadoop? Is there any
> alternative way to accomplish the same objective? I am stuck with this
> problem for about one week. I would appreciate if someone would shed some
> light on it.
>
> Thanks,
> Farhan
>

Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

2009-12-31 Thread Steve Kuo

Digging around the new Job api with a rested brain came up with

 job.setInputFormatClass(LzoTextInputFormat.class);

that solved the problem.

On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo  wrote:

> I have followed
> http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
> http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the
> requisite hadoop-lzo jar and native .so files.  (The jar and .so files were
> built from Kevin Weil's git repository.  Thanks Kevin.)  I have configured
> core-site.xml and mapred-site.xml as instructed to enable lzo for both map
> and reduce output.  The creation of lzo index also worked. The last step was
> to replace TextInputFormat with LzoTextInputFormat.  As I only have
>
> FileInputFormat.addInputPath(jobConf, new Path(inputPath));
>
> it was replaced with
>
>  LzoTextInputFormat.addInputPath(job, new Path(inputPath));
>
> When I ran my MR job, I noticed that the new code was able to read in .lzo
> input files and decompressed fine.   The output was also lzo compressed.
> However, only one map job was created for each input .lzo file indicating
> that input splitting was not done by LzoTextInputFormat but more likely by
> its parent such as FileInputFormat.  There must be a way to ensure
> LzoTextInputFormat is used in the Map task.  How can this be done?
>
> Thanks in advance.
>
>

How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

2009-12-31 Thread Steve Kuo

I have followed
http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the
requisite hadoop-lzo jar and native .so files.  (The jar and .so files were
built from Kevin Weil's git repository.  Thanks Kevin.)  I have configured
core-site.xml and mapred-site.xml as instructed to enable lzo for both map
and reduce output.  The creation of lzo index also worked. The last step was
to replace TextInputFormat with LzoTextInputFormat.  As I only have

FileInputFormat.addInputPath(jobConf, new Path(inputPath));

it was replaced with

 LzoTextInputFormat.addInputPath(job, new Path(inputPath));

When I ran my MR job, I noticed that the new code was able to read in .lzo
input files and decompressed fine.   The output was also lzo compressed.
However, only one map job was created for each input .lzo file indicating
that input splitting was not done by LzoTextInputFormat but more likely by
its parent such as FileInputFormat.  There must be a way to ensure
LzoTextInputFormat is used in the Map task.  How can this be done?

Thanks in advance.

What is the best to terminate a Map job without it being retried

Re: Parallel out-of-order Map capability?

Re: Access Error

Re: How to config Map only job to read .gz input files and output result in .lzo

How to config Map only job to read .gz input files and output result in .lzo

Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

Re: Problem with large .lzo files

Re: Problem with large .lzo files

Re: Problem with large .lzo files

Re: Problem with large .lzo files

Problem with large .lzo files

Re: isSplitable() deprecated

Re: Cannot pass dynamic values by Configuration.Set()

Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

How to ensure LzoTextInputFormat is used to generate input splits for .lzo files

16 matches

Site Navigation

Mail list logo

Footer information