Re: sc.textFile problem due to newlines within a CSV record

Xiangrui Meng Fri, 12 Sep 2014 21:39:56 -0700

I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.


Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping before
calling saveAsTextFIle, then use the input format to load them back.

Xiangrui

On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:
> Folks,
> I think this might be due to the default TextInputFormat in Hadoop. Any
> pointers to solutions much appreciated.
>>>
> More powerfully, you can define your own InputFormat implementations to
> format the input to your programs however you want. For example, the default
> TextInputFormat reads lines of text files. The key it emits for each record
> is the byte offset of the line read (as a LongWritable), and the value is
> the contents of the line up to the terminating '\n' character (as a Text
> object). If you have multi-line records each separated by a $character, you
> could write your own InputFormat that parses files into records split on
> this character instead.
>>>
>
> Thanks,
> Mohit

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sc.textFile problem due to newlines within a CSV record

Reply via email to