loss of data type following ORDER BY

Bockhorst, Joseph P Wed, 27 Aug 2014 10:40:53 -0700

Hi all,

Type information apparently gets lost following ORDER BY in certain situations. 
Am I doing something wrong or is this a bug? I am a Pig newbie.


I'm running Apache Pig version 0.12.1.2.1.1.0-385 (rexported) in local mode 
(ie, pig -x local)

I have a small test file with two fields:

grunt> ls /data
file:/data/.pig_schema<r 1>     242
file:/data/.pig_header<r 1>     11
file:/data/part-m-00000<r 1>    33
file:/data/_SUCCESS<r 1>        0
grunt> cat /data/part-m-00000
foo,3.0
bar,4.0
foo,10.0
hi,12.0

I load it, project a column, order that column and the type gets lost:

grunt> A = LOAD '/data' USING PigStorage(',', '-schema');
grunt> DESCRIBE A;
A: {name: chararray,value: double}
grunt> B = foreach A generate $1 as value:double;
grunt> DESCRIBE B;
B: {value: double}
grunt> C = ORDER B by value;
grunt> DESCRIBE C;
C: {value: double}
grunt> DUMP C;

Pig fails during  DUMP C:

java.lang.Exception: java.lang.ClassCastException: 
org.apache.pig.data.DataByteArray cannot be cast to java.lang.Double
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.ClassCastException: org.apache.pig.data.DataByteArray 
cannot be cast to java.lang.Double
        at 
org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:109)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:111)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:284)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

In the DESCRIBE right before the DUMP the type was 'double' but during the dump 
it's  'bytearray'

Also,
- DUMP B; following the foreach statement works fine.
- Without the foreach the script works fine:

--
-- This script works
--
A = LOAD '/data' USING PigStorage(',', '-schema');
C = ORDER A by value;
DUMP C;

Thanks!

Joe

American Family Insurance Company | American Family Life Insurance Company | 
American Family Mutual Insurance Company | American Standard Insurance Company 
of Ohio | American Standard Insurance Company of Wisconsin | Midvale Indemnity 
Company | Home Office - 6000 American Parkway | Madison, WI 53783

*If you are not the intended recipient, please contact the sender and delete 
this e-mail, any attachments and all copies.

loss of data type following ORDER BY

Reply via email to