>> So, considering this situation of loading mixed good and corrupted ".gz" 
>> files, how to still get expected results?
Try manipulating the value mapred.max.map.failures.percent to a % of files you 
expect to be corrupted / acceptable data skip percent.

Amogh

On 2/21/10 7:17 AM, "jiang licht" <licht_ji...@yahoo.com> wrote:

I think I found what caused the problem. Actually, the folder to load to 'a' 
contains all ".gz" files. Somehow, some .gz files are corrupted. Thus, 
"java.io.EOFException: Unexpected end of ZLIB input stream" were thrown.

I did the following test: I truncated a ".gz" file and name it "corrupted.gz". 
Then load only this file it to 'a' and execute the same remaining scripts. This 
cause the exact same error message dumped as given in the 1st post. The same 
thing happens if loading both this file and other good gz files.

My guess is that such corrupted files will not be loaded (since the above 
exception will be thrown). But data from good .gz files still got loaded. Then 
why empty result is generated (0-sized part-####)? So, considering this 
situation of loading mixed good and corrupted ".gz" files, how to still get 
expected results?

Thanks!


Michael

--- On Sat, 2/20/10, Ashutosh Chauhan <ashutosh.chau...@gmail.com> wrote:

From: Ashutosh Chauhan <ashutosh.chau...@gmail.com>
Subject: Re: Unexpected empty result problem (zero-sized part-### files)?
To: common-user@hadoop.apache.org
Date: Saturday, February 20, 2010, 7:29 PM

A log file with a name like pig_1234567890.log must be sitting in the
directory from where you launched your pig script. Can you send its content
?

Ashutosh

On Sat, Feb 20, 2010 at 16:41, jiang licht <licht_ji...@yahoo.com> wrote:

> I have a pig script as follows (see far below). It loads 2 data sets,
> perform some filtering, then join the two sets. Lastly count occurrences of
> a combination of fields and writes results to hdfs.
>
> --load raw data
>
> a = LOAD 'foldera/*';
>
>
>
> b = LOAD 'somefile';
>
>
>
> --choose rows and columns
>
> a_filtered = FILTER a BY somecondition;
>
>
>
> a_filtered_shortened = FOREACH a_filtered GENERATE somefields;
>
>
>
> a_filtered_shortened_unique = DISTINCT a_filtered_short PARALLEL #;
>
>
>
> --join a & b and count occurrences of a combination of fields
>
> ab = JOIN a_filtered_short_unique BY somefield, b by somefield PARALLEL
> #;
>
>
>
> ab_shortened = FOREACH ab GENERATE somefileds;
>
>
>
> ab_shortened_grouped = GROUP ab_shortened BY ($0, $1) PARALLEL #;
>
>
>
> --c will contain: fields, counts
>
> c = FOREACH ab_shortened_grouped GENERATE FLATTEN($0),
> COUNT(ab_shortened);
>
>
>
> --save results
>
> STORE c INTO 'MYRESULTS' USING PigStorage();
>
> PROBLEM is that empty sets (empty part-#### files) were generated. But a
> non-empty result is expected. For example, if I chose to load one file
> (instead of loading all files in a folder) to 'a', quite a number of tuples
> are created (non-empty part-### files).
>
> It seems to me the logic in the script is good and it generates correct
> result for randomly selected file anyway. So, I am wondering what could
> cause this empty result problem?
>
> FYI, I ran the same script multiple time and all gave me empty part-###
> files. Though in the output, I did see repeatedly error message similar to
> the following ones that show one result file is failed to produce (these are
> last lines from job output). Could this be the problem? How to locate the
> cause? Thanks!
>
> ...
> 2010-02-20 16:21:37,737 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 86% complete
> 2010-02-20 16:21:38,239 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 87% complete
> 2010-02-20 16:21:39,265 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 88% complete
> 2010-02-20 16:21:44,286 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 93% complete
> 2010-02-20 16:21:46,931 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 95% complete
> 2010-02-20 16:21:47,432 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 99% complete
> 2010-02-20 16:21:54,005 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2010-02-20 16:21:54,005 [main] ERROR
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map reduce job(s) failed!
> 2010-02-20 16:21:54,008 [main] ERROR
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Failed to produce result in:
> "hdfs://hostA:50001/tmp/temp829697187/tmp-531977953"
> 2010-02-20 16:21:54,008 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Successfully stored result in:
> "hdfs://hostA:50001/tmp/temp829697187/tmp504533728"
> 2010-02-20 16:21:54,023 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Successfully stored result in: "hdfs://hostA:50001/user/root/MYRESULTS"
> 2010-02-20 16:21:54,056 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Records written : 0
> 2010-02-20 16:21:54,056 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Bytes written : 0
> 2010-02-20 16:21:54,056 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Some jobs have failed!
>
>
>
>





Reply via email to