*I am loading 3 data sources with data like:*
Source 1 (main data source):
id, id1, userId, type

Source 2 (supporting source used for filtering):
parition_number, id, id1

Source 3 (static set source with all allowed types):
type


*Am using the following pig script, to count the unique userIds by type:*
grains = CROSS source1, source2;

users = JOIN
    grains BY (source2::id, source2::id1, source3::type) LEFT OUTER,
    source1 BY (id, id1, type);

usersGrouped = GROUP users
    BY (metricsGrains::grain1::partitionNumber,
        grains:: source2::organizationId,
        grains:: source2::networkId,
        grains:: source3::browserFormFactor)
    PARTITION BY MyCustomPartitioner PARALLEL 32;

counts = FOREACH usersGrouped {
    userCountCount = DISTINCT users.(source1::userId);
    GENERATE FLATTEN(group), COUNT(userCount);
}

STORE counts INTO 'output';


*My partitioner is quite simple (It just fetches a hdfs partition from 0 to
31 based on the partition number 1 to 32):*

public class MyCustomPartitioner extends Partitioner<PigNullableWritable,
NullableTuple> {

    @Override

    public int getPartition(PigNullableWritable partitionWritable,
NullableTuple valueWritable, int numPartitions) {

        String partition = partitionWritable.getValueAsPigType().toString();

        int inputPartitionNum = Integer.valueOf(partition.substring(1,
partition.indexOf(",")));

        int hdfsPartitionNum = inputPartitionNum - 1;

        checkState(hdfsPartitionNum >= 0 && hdfsPartitionNum <
numPartitions, "Invalid partition chosen: " + hdfsPartitionNum);

        return hdfsPartitionNum;

    }

}


So, data from input partition 1 should always result in part0000 file,
partition 2 data should go in part0001 file and so on. But sometime,
partition 1 data is resulting in part0005 (any random partition) file. This
is not happening for all the data sets but for some and that too randomly.

I am using Hadoop 2.3 with Pig 0.13. Please advise what could be the issue
here?


Thanks,

Shakti

Reply via email to