[ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leo Heska updated PIG-2613: --------------------------- Summary: Pig substitutes/mangles "upper ASCII" characters (values > 127) (was: Pig substitutes/mangels "upper ASCII" characters (values > 127)) > Pig substitutes/mangles "upper ASCII" characters (values > 127) > --------------------------------------------------------------- > > Key: PIG-2613 > URL: https://issues.apache.org/jira/browse/PIG-2613 > Project: Pig > Issue Type: Bug > Components: data, parser > Affects Versions: 0.8.1 > Environment: linux > Reporter: Leo Heska > > Create small/dummy input file that contains ASCII 254 (decimal) characters. > These are often represented as the Thorn character. A sample line looks like > this: > 1þ4þaaaþbbbþcccþdddþ7þ8þ9 > but your browser may not render that correctly. Hex representation of that > sample line: > 31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A > or, with spaces added for your convenience in reading: > 31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 > 0D 0A > You can see that this is just a sample line of plain ASCII numerals and > lower-case letters, separated by the FE (hex) or 254 (decimal) code point. > Now load, like this: > dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as > (line:chararray); > A dump > dump dummyts; > > shows this: > (1�4�aaa�bbb�ccc�ddd�7�8�9) > The problem does not seem to be with the dump. I have written a UDF that > counts characters in the line and returns TRUE if the character count is > correct. When I do this: > > fd = filter dummyts by CountRight(line, 254, 8); > which is saying "validate that there are 8 instances of the ASCII 254 code > point/character" I get no results. When I do this: > fd1 = filter dummyts by CountRight(line, 97, 3); > which says "validate that there are three instances of the 'a' (ASCII 97) > character the results are perfect. > It looks like something in Pig's load is changing instances of ASCII 254 to > the following three characters: > � -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira