Hi,

I would like to use pig to work with wikipedia dump files. It works successfully with an input file of around 8GB of size but not too big xml element content.

In my current case I would like to use the file "enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2" (around 2GB of compressed size) which can be found here:

http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2

Is it possible that due to the fact that the content of the <page></page> xml element could potentially become very large (several GB for instance) XMLLoader of Piggybank has problems loading elements splitted by <page>?

Hopefully anybody could help me with this.

I've tried to call the following PIG Latin script:

=========
register piggybank.jar;

pages = load '/user/herbert/enwiki-latest-pages-meta-history1.xml- p000000010p000002162.bz2' using org.apache.pig.piggybank.storage.XMLLoader('page') as (page:chararray);
pages = limit pages 1;
dump pages;
=========

and always get the following error (the generated logfile is attached):

=========

2012-03-28 14:49:54,695 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45 2012-03-28 14:49:54,696 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/herbert/Documents/workspace/pig-wikipedia/pig_1332938994693.log 2012-03-28 14:49:54,936 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /Users/herbert/.pigbootup not found 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-03-28 14:49:56,274 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5733074907123320640.jar 2012-03-28 14:49:59,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5733074907123320640.jar created 2012-03-28 14:49:59,736 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-03-28 14:49:59,795 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
****hdfs://localhost:9000/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2
2012-03-28 14:50:00,152 [Thread-11] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-03-28 14:50:00,169 [Thread-11] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 35 2012-03-28 14:50:00,299 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-03-28 14:50:01,277 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201203281105_0009 2012-03-28 14:50:01,278 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201203281105_0009 2012-03-28 14:50:23,145 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1% complete 2012-03-28 14:50:29,206 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 2% complete 2012-03-28 14:50:38,288 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 4% complete 2012-03-28 14:53:17,686 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 7% complete 2012-03-28 14:53:41,529 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 9% complete 2012-03-28 14:55:05,775 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete 2012-03-28 14:55:32,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 12% complete 2012-03-28 14:56:21,754 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 13% complete 2012-03-28 14:58:36,797 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201203281105_0009 has failed! Stop running all dependent jobs 2012-03-28 14:58:36,799 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space 2012-03-28 14:58:36,850 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2012-03-28 14:58:36,854 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.1   0.11.0-SNAPSHOT herbert 2012-03-28 14:49:56     2012-03-28 14:58:36     
LIMIT

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201203281105_0009 pages Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201203281105_0009_m_000003 hdfs://localhost:9000/tmp/temp1813558187/tmp250990633,

Input(s):
Failed to read data from "/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2"

Output(s):
Failed to produce result in "hdfs://localhost:9000/tmp/temp1813558187/tmp250990633"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201203281105_0009


2012-03-28 14:58:36,855 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2012-03-28 14:58:36,891 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space Details at logfile: /Users/herbert/Documents/workspace/pig-wikipedia/pig_1332938994693.log
pig wiki.pig  8,48s user 2,72s system 2% cpu 8:46,07 total

=========

Thank you very much and kind reagards,
Herbert
Backend error message
---------------------
Error: Java heap space

Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: Error: Java heap 
space

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open 
iterator for alias pages. Backend error : Unable to recreate exception from 
backed error: Error: Java heap space
        at org.apache.pig.PigServer.openIterator(PigServer.java:848)
        at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:657)
        at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:305)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
        at org.apache.pig.Main.run(Main.java:601)
        at org.apache.pig.Main.main(Main.java:153)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: 
Unable to recreate exception from backed error: Error: Java heap space
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:149)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1271)
        at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1256)
        at org.apache.pig.PigServer.storeEx(PigServer.java:953)
        at org.apache.pig.PigServer.store(PigServer.java:920)
        at org.apache.pig.PigServer.openIterator(PigServer.java:833)
        ... 12 more
================================================================================

Reply via email to