Re: Problem loading sequence files with Elephant Bird

Raghu Angadi Mon, 21 May 2012 13:40:24 -0700

'AS' is almost always dangerous. The loader already has a schema. Use a
projection if you want to rename them.


On Fri, May 18, 2012 at 4:07 PM, Chris Diehl <cpdi...@gmail.com> wrote:

> With a little bit of luck, we managed to find an answer.
>
> Turns out we needed to remove the cast from key and run the script in Pig
> 0.10. I was running the script with Pig 0.8.1 up until today.
>
> raw_logs = LOAD '$INPUT_LOCATION' USING $SEQFILE_LOADER ('-c
> $NULL_CONVERTER','-c $TEXT_CONVERTER')
>     AS (key, value: chararray);
>
> Chris
>
> On Fri, May 18, 2012 at 2:27 PM, Chris Diehl <cpdi...@gmail.com> wrote:
>
> > Hi Andy,
> >
> > Here's what is in the log file.
> >
> > Pig Stack Trace
> > ---------------
> > ERROR 2244: Job failed, hadoop does not return any error message
> >
> > org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job
> > failed, hadoop does not return any error message
> > at
> > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:119)
> >  at
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
> > at
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> >  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> > at org.apache.pig.Main.run(Main.java:500)
> >  at org.apache.pig.Main.main(Main.java:107)
> >
> >
> ================================================================================
> >
> > I am running it on the cluster. I could not find any additional
> > information on the job tracker.
> >
> > The keys in the sequence files are all null. The values are all JSON
> > strings. Given that information, I tried configuring the
> SequenceFileLoader
> > this way to no avail.
> >
> > %declare SEQFILE_LOADER
> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> > %declare TEXT_CONVERTER
> 'com.twitter.elephantbird.pig.util.TextConverter';
> > %declare NULL_CONVERTER
> > 'com.twitter.elephantbird.pig.util.NullWritableConverter'
> >
> > raw_logs = LOAD '$INPUT_LOCATION' USING $SEQFILE_LOADER ('-c
> > $NULL_CONVERTER','-c $TEXT_CONVERTER') AS (key: chararray, value:
> > chararray);
> >
> > Is there another way I should be configuring it?
> >
> > Chris
> >
> > On Fri, May 18, 2012 at 11:24 AM, Andy Schlaikjer <
> > andrew.schlaik...@gmail.com> wrote:
> >
> >> Chris, the console output mentions file "/opt/shared_storage/log_
> >> analysis_pig_python_scripts/pig_1337299061301.log". Does this contain
> any
> >> kind of stack trace? Were you running the script in local mode or on a
> >> cluster? If the latter, there should be at least map task log output
> >> someplace that may also have some clues.
> >>
> >> Does path
> >> '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> contain SequenceFile<Text, Text> data? If not, you'll have to configure
> >> SequenceFileLoader further to properly deserialize the key-value pairs.
> >>
> >> Andy
> >>
> >>
> >> On Thu, May 17, 2012 at 5:07 PM, Chris Diehl <cpdi...@gmail.com> wrote:
> >>
> >> > Andy,
> >> >
> >> > Here's what I'm seeing when I run the following script. There's no
> >> > information beyond what is here in the log file.
> >> >
> >> > Chris
> >> >
> >> > REGISTER
> >> >
> >>
> '/opt/shared_storage/elephant-bird/build/elephant-bird-2.2.3-SNAPSHOT.jar';
> >> > %declare SEQFILE_LOADER
> >> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> >> > %declare TEXT_CONVERTER
> >> 'com.twitter.elephantbird.pig.util.TextConverter';
> >> > %declare NULL_CONVERTER
> >> > 'com.twitter.elephantbird.pig.util.NullWritableConverter'
> >> >
> >> > rmf /data/SearchLogJSON;
> >> >
> >> > -- Load raw log data
> >> > raw_logs = LOAD
> >> > '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> USING
> >> > $SEQFILE_LOADER ();
> >> >
> >> > -- Store the JSON
> >> > STORE raw_logs INTO '/data/SearchLogJSON/';
> >> >
> >> > -------------------
> >> >
> >> > -sh-3.2$ pig dump_log_json.pig
> >> > 2012-05-17 23:57:41,304 [main] INFO  org.apache.pig.Main - Logging
> error
> >> > messages to:
> >> >
> >>
> /opt/shared_storage/log_analysis_pig_python_scripts/pig_1337299061301.log
> >> > 2012-05-17 23:57:41,586 [main] INFO
> >> >  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> > Connecting to hadoop file system at: XXX
> >> > 2012-05-17 23:57:41,932 [main] INFO
> >> >  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> > Connecting to map-reduce job tracker at: XXX
> >> > 2012-05-17 23:57:42,204 [main] INFO
> >> >  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> > script: UNKNOWN
> >> > 2012-05-17 23:57:42,204 [main] INFO
> >> >  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> > pig.usenewlogicalplan is set to true. New logical plan will be used.
> >> > 2012-05-17 23:57:42,301 [main] INFO
> >> >  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> (Name:
> >> > raw_logs:
> Store(/data/SearchLogJSON:org.apache.pig.builtin.PigStorage) -
> >> > scope-1 Operator Key: scope-1)
> >> > 2012-05-17 23:57:42,317 [main] INFO
> >> >
> >>
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> >> > File concatenation threshold: 100 optimistic? false
> >> > 2012-05-17 23:57:42,349 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> > - MR plan size before optimization: 1
> >> > 2012-05-17 23:57:42,349 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> > - MR plan size after optimization: 1
> >> > 2012-05-17 23:57:42,529 [main] INFO
> >> >  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> >> added
> >> > to the job
> >> > 2012-05-17 23:57:42,545 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to
> default
> >> 0.3
> >> > 2012-05-17 23:57:44,706 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> > - Setting up single store job
> >> > 2012-05-17 23:57:44,734 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 1 map-reduce job(s) waiting for submission.
> >> > 2012-05-17 23:57:45,053 [Thread-4] INFO
> >> >  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> >> paths
> >> > to process : 1
> >> > 2012-05-17 23:57:45,057 [Thread-4] INFO
> >> >  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> > input paths (combined) to process : 1
> >> > 2012-05-17 23:57:45,236 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 0% complete
> >> > 2012-05-17 23:57:45,849 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - HadoopJobId: job_201205170527_0003
> >> > 2012-05-17 23:57:45,849 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - More information at: XXX
> >> > 2012-05-17 23:58:25,816 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - job job_201205170527_0003 has failed! Stop running all dependent
> jobs
> >> > 2012-05-17 23:58:25,821 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 100% complete
> >> > 2012-05-17 23:58:25,824 [main] ERROR
> >> > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s)
> failed!
> >> > 2012-05-17 23:58:25,825 [main] INFO
> >>  org.apache.pig.tools.pigstats.PigStats
> >> > - Script Statistics:
> >> >
> >> > HadoopVersion PigVersion UserId StartedAt FinishedAt Features
> >> > 0.20.2-cdh3u2 0.8.1-cdh3u2 chris.diehl 2012-05-17 23:57:42 2012-05-17
> >> > 23:58:25 UNKNOWN
> >> >
> >> > Failed!
> >> >
> >> > Failed Jobs:
> >> > JobId Alias Feature Message Outputs
> >> > job_201205170527_0003 raw_logs MAP_ONLY Message: Job failed! Error -
> NA
> >> > /data/SearchLogJSON,
> >> >
> >> > Input(s):
> >> > Failed to read data from
> >> > "/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq"
> >> >
> >> > Output(s):
> >> > Failed to produce result in "/data/SearchLogJSON"
> >> >
> >> > Counters:
> >> > Total records written : 0
> >> > Total bytes written : 0
> >> > Spillable Memory Manager spill count : 0
> >> > Total bags proactively spilled: 0
> >> > Total records proactively spilled: 0
> >> >
> >> > Job DAG:
> >> > job_201205170527_0003
> >> >
> >> >
> >> > 2012-05-17 23:58:25,825 [main] INFO
> >> >
> >> >
> >>
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - Failed!
> >> > 2012-05-17 23:58:25,831 [main] ERROR
> >> org.apache.pig.tools.grunt.GruntParser
> >> > - ERROR 2244: Job failed, hadoop does not return any error message
> >> > Details at logfile:
> >> >
> >>
> /opt/shared_storage/log_analysis_pig_python_scripts/pig_1337299061301.log
> >> >
> >> >
> >> >
> >> > On Thu, May 17, 2012 at 1:20 PM, Andy Schlaikjer <
> >> > andrew.schlaik...@gmail.com> wrote:
> >> >
> >> > > Chris, could you send us any of your error logs? What kind of
> failures
> >> > are
> >> > > you running into?
> >> > >
> >> > > Andy
> >> > >
> >> > >
> >> > > On Wed, May 16, 2012 at 11:47 AM, Chris Diehl <cpdi...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi All,
> >> > > >
> >> > > > I'm attempting to load sequence files for the first using Elephant
> >> > Bird's
> >> > > > sequence file loader and having absolutely no luck.
> >> > > >
> >> > > > I did a hadoop fs -text one on of the sequence files and noticed
> all
> >> > the
> >> > > > keys are (null). Not sure if that is throwing off things here.
> >> > > >
> >> > > > Here are various approaches I've tried that all have failed.
> >> > > >
> >> > > > REGISTER
> >> > > >
> >> > >
> >> >
> >>
> '/opt/shared_storage/elephant-bird/build/elephant-bird-2.2.3-SNAPSHOT.jar';
> >> > > > %declare SEQFILE_LOADER
> >> > > > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> >> > > > %declare TEXT_CONVERTER
> >> > > 'com.twitter.elephantbird.pig.util.TextConverter';
> >> > > > %declare NULL_CONVERTER
> >> > > > 'com.twitter.elephantbird.pig.util.NullWritableConverter'
> >> > > >
> >> > > > raw_logs = LOAD
> >> > > >
> >> '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> > > USING
> >> > > > $SEQFILE_LOADER ('-c $NULL_CONVERTER','-c $TEXT_CONVERTER') AS
> (key:
> >> > > > bytearray, value: chararray);
> >> > > > --raw_logs = LOAD
> >> > > >
> >> '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> > > USING
> >> > > > $SEQFILE_LOADER ('-c $TEXT_CONVERTER','-c $TEXT_CONVERTER') AS
> (key:
> >> > > > chararray, value: chararray);
> >> > > > --raw_logs = LOAD
> >> > > >
> >> '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> > > USING
> >> > > > $SEQFILE_LOADER ();
> >> > > >
> >> > > > STORE raw_logs INTO '/data/SearchLogJSON/';
> >> > > >
> >> > > > Any thoughts on what might be the problem? Anything else I should
> >> try?
> >> > > I'm
> >> > > > totally out of ideas.
> >> > > >
> >> > > > Appreciate any pointers!
> >> > > >
> >> > > > Chris
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Problem loading sequence files with Elephant Bird

Reply via email to