Hi,
I would like to use pig to work with wikipedia dump files. It works
successfully with an input file of around 8GB of size but not too big
xml element content.
In my current case I would like to use the file
"enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2" (around
2GB of compressed size) which can be found here:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2
Is it possible that due to the fact that the content of the
<page></page> xml element could potentially become very large (several
GB for instance) XMLLoader of Piggybank has problems loading elements
splitted by <page>?
Hopefully anybody could help me with this.
I've tried to call the following PIG Latin script:
=========
register piggybank.jar;
pages = load '/user/herbert/enwiki-latest-pages-meta-history1.xml-
p000000010p000002162.bz2' using
org.apache.pig.piggybank.storage.XMLLoader('page') as (page:chararray);
pages = limit pages 1;
dump pages;
=========
and always get the following error (the generated logfile is attached):
=========
2012-03-28 14:49:54,695 [main] INFO org.apache.pig.Main - Apache Pig
version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
2012-03-28 14:49:54,696 [main] INFO org.apache.pig.Main - Logging error
messages to:
/Users/herbert/Documents/workspace/pig-wikipedia/pig_1332938994693.log
2012-03-28 14:49:54,936 [main] INFO org.apache.pig.impl.util.Utils -
Default bootup file /Users/herbert/.pigbootup not found
2012-03-28 14:49:55,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000
2012-03-28 14:49:55,403 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at: localhost:9001
2012-03-28 14:49:55,845 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: LIMIT
2012-03-28 14:49:56,021 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
- File concatenation threshold: 100 optimistic? false
2012-03-28 14:49:56,067 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-03-28 14:49:56,067 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-03-28 14:49:56,171 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
added to the job
2012-03-28 14:49:56,187 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-03-28 14:49:56,274 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- creating jar file Job5733074907123320640.jar
2012-03-28 14:49:59,720 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- jar file Job5733074907123320640.jar created
2012-03-28 14:49:59,736 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-03-28 14:49:59,795 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
****hdfs://localhost:9000/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2
2012-03-28 14:50:00,152 [Thread-11] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
paths to process : 1
2012-03-28 14:50:00,169 [Thread-11] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths (combined) to process : 35
2012-03-28 14:50:00,299 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-03-28 14:50:01,277 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201203281105_0009
2012-03-28 14:50:01,278 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at:
http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009
2012-03-28 14:50:23,145 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1% complete
2012-03-28 14:50:29,206 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 2% complete
2012-03-28 14:50:38,288 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 4% complete
2012-03-28 14:53:17,686 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 7% complete
2012-03-28 14:53:41,529 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 9% complete
2012-03-28 14:55:05,775 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 10% complete
2012-03-28 14:55:32,685 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 12% complete
2012-03-28 14:56:21,754 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 13% complete
2012-03-28 14:58:36,797 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_201203281105_0009 has failed! Stop running all dependent jobs
2012-03-28 14:58:36,799 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2012-03-28 14:58:36,850 [main] ERROR
org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
recreate exception from backed error: Error: Java heap space
2012-03-28 14:58:36,850 [main] ERROR
org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2012-03-28 14:58:36,854 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.1 0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56 2012-03-28 14:58:36
LIMIT
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201203281105_0009 pages Message: Job failed! Error - # of failed
Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:
task_201203281105_0009_m_000003
hdfs://localhost:9000/tmp/temp1813558187/tmp250990633,
Input(s):
Failed to read data from
"/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2"
Output(s):
Failed to produce result in
"hdfs://localhost:9000/tmp/temp1813558187/tmp250990633"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201203281105_0009
2012-03-28 14:58:36,855 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed!
2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2997: Unable to recreate exception from backed error: Error: Java
heap space
Details at logfile:
/Users/herbert/Documents/workspace/pig-wikipedia/pig_1332938994693.log
pig wiki.pig 8,48s user 2,72s system 2% cpu 8:46,07 total
=========
Thank you very much and kind reagards,
Herbert
Backend error message
---------------------
Error: Java heap space
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: Error: Java heap
space
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open
iterator for alias pages. Backend error : Unable to recreate exception from
backed error: Error: Java heap space
at org.apache.pig.PigServer.openIterator(PigServer.java:848)
at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:657)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:305)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:601)
at org.apache.pig.Main.main(Main.java:153)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
Unable to recreate exception from backed error: Error: Java heap space
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:149)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1271)
at
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1256)
at org.apache.pig.PigServer.storeEx(PigServer.java:953)
at org.apache.pig.PigServer.store(PigServer.java:920)
at org.apache.pig.PigServer.openIterator(PigServer.java:833)
... 12 more
================================================================================