Re: sc.textFile problem due to newlines within a CSV record
Thanks Xiangrui. This file already exists w/o escapes. I could probably try to preprocess it and add the escaping. On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng men...@gmail.com wrote: I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping before calling saveAsTextFIle, then use the input format to load them back. Xiangrui On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I think this might be due to the default TextInputFormat in Hadoop. Any pointers to solutions much appreciated. More powerfully, you can define your own InputFormat implementations to format the input to your programs however you want. For example, the default TextInputFormat reads lines of text files. The key it emits for each record is the byte offset of the line read (as a LongWritable), and the value is the contents of the line up to the terminating '\n' character (as a Text object). If you have multi-line records each separated by a $character, you could write your own InputFormat that parses files into records split on this character instead. Thanks, Mohit
Re: sc.textFile problem due to newlines within a CSV record
I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping before calling saveAsTextFIle, then use the input format to load them back. Xiangrui On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I think this might be due to the default TextInputFormat in Hadoop. Any pointers to solutions much appreciated. More powerfully, you can define your own InputFormat implementations to format the input to your programs however you want. For example, the default TextInputFormat reads lines of text files. The key it emits for each record is the byte offset of the line read (as a LongWritable), and the value is the contents of the line up to the terminating '\n' character (as a Text object). If you have multi-line records each separated by a $character, you could write your own InputFormat that parses files into records split on this character instead. Thanks, Mohit - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org