[jira] Created: (PIG-1547) Piggybank MultiStorage does not scale when processing around 7k records per bucket

Viraj Bhat (JIRA) Tue, 17 Aug 2010 16:42:46 -0700

Piggybank MultiStorage does not scale when processing around 7k records per 
bucket
----------------------------------------------------------------------------------


                 Key: PIG-1547
                 URL: https://issues.apache.org/jira/browse/PIG-1547
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Viraj Bhat


I am trying to use the MultiStorage piggybank UDF
{code}
register pig-svn/trunk/contrib/piggybank/java/piggybank.jar;
A = load '/user/viraj/largebucketinput.txt' using PigStorage('\u0001') as 
(a,b,c);
STORE A INTO '/user/viraj/multistore' USING 
org.apache.pig.piggybank.storage.MultiStorage('/user/viraj/multistore', '1', 
'none', '\u0001');
{code}
The file "largebucketinput.txt" is around 85MB in size and for each "b" we have 
512 values starting from 0-511 and each value of b or a bucket contains 7k 
records

a) On a multi-node hadoop installation:
The above Pig script which spawn a single Map only job does not succeed and is 
killed by the TT, for running above the memory limit.

== Message == 
TaskTree [pid=24584,tipID=attempt_201008110143_101976_m_000000_0] is running 
beyond memory-limits. Current usage : 1661034496bytes. Limit : 1610612736bytes.
== Message == 
We tried increasing the Map slots but it does not succeed.

b) On a single node hadoop installation:
The pig script fails with the following message in the mappers:

2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_7687609983190239805_126509
2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_2734778934507357565_126509
2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-1293917224803067377_126509
2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.EOFException
2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-2272713260404734116_126509
2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Unable to create new block.
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2781)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232)

2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-2272713260404734116_126509 bad datanode[0] nodes == null
2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
block locations. Source file 
"/user/viraj/multistore/_temporary/_attempt_201005141440_0178_m_000001_0/444/444-1"
 - Aborting...
2010-08-17 16:37:48,619 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
        at org.apache.hadoop.io.Text.readString(Text.java:400)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2837)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2762)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232)
2010-08-17 16:37:48,622 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task


Need to investigate more.

Viraj






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1547) Piggybank MultiStorage does not scale when processing around 7k records per bucket

Reply via email to