[ 
https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238198#comment-13238198
 ] 

Daniel Dai commented on PIG-2613:
---------------------------------

Can you attach your input?
                
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. 
> These are often represented as the Thorn character. A sample line looks like 
> this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that 
> sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 
> 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and 
> lower-case letters, separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as 
> (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that 
> counts characters in the line and returns TRUE if the character count is 
> correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code 
> point/character" I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) 
> character the results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to 
> the following three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to