Re: sc.textFile problem due to newlines within a CSV record

2014-09-13 Thread Mohit Jaggi
Thanks Xiangrui. This file already exists w/o escapes. I could probably try
to preprocess it and add the escaping.

On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng men...@gmail.com wrote:

 I wrote an input format for Redshift's tables unloaded UNLOAD the
 ESCAPE option: https://github.com/mengxr/redshift-input-format , which
 can recognize multi-line records.

 Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
 the delimiter character. You can apply the same escaping before
 calling saveAsTextFIle, then use the input format to load them back.

 Xiangrui

 On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi mohitja...@gmail.com wrote:
  Folks,
  I think this might be due to the default TextInputFormat in Hadoop. Any
  pointers to solutions much appreciated.
 
  More powerfully, you can define your own InputFormat implementations to
  format the input to your programs however you want. For example, the
 default
  TextInputFormat reads lines of text files. The key it emits for each
 record
  is the byte offset of the line read (as a LongWritable), and the value is
  the contents of the line up to the terminating '\n' character (as a Text
  object). If you have multi-line records each separated by a $character,
 you
  could write your own InputFormat that parses files into records split on
  this character instead.
 
 
  Thanks,
  Mohit



Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.

Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping before
calling saveAsTextFIle, then use the input format to load them back.

Xiangrui

On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi mohitja...@gmail.com wrote:
 Folks,
 I think this might be due to the default TextInputFormat in Hadoop. Any
 pointers to solutions much appreciated.

 More powerfully, you can define your own InputFormat implementations to
 format the input to your programs however you want. For example, the default
 TextInputFormat reads lines of text files. The key it emits for each record
 is the byte offset of the line read (as a LongWritable), and the value is
 the contents of the line up to the terminating '\n' character (as a Text
 object). If you have multi-line records each separated by a $character, you
 could write your own InputFormat that parses files into records split on
 this character instead.


 Thanks,
 Mohit

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org