subject:"\"Re\\\: Spark corrupts text lines\""

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren

I managed to get remote debugging up and running and can in fact
reproduce the error and get a breakpoint triggered as it happens.

But it seems like the code does not go through TextInputFormat, or at
least the breakpoint is not triggered from this class? Don't know what
other class to look for the actual split could to occur?

Any pointers?



On Tue, Jun 14, 2016 at 4:03 PM, Kristoffer Sjögren  wrote:
> I'm pretty confident the lines are encoded correctly since I can read
> them both locally and on Spark (by ignoring the faulty line and
> proceed to next). I also get the correct number of lines through
> Spark, again by ignoring the faulty line.
>
> I get the same error by reading the original file using Spark, save as
> new text file, then try decoding again.
>
> context.textFile("/orgfile").saveAsTextFile("/newfile");
>
> Ok, not much left than to do some remote debugging.
>
>
> On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren  wrote:
>> Thanks for you help. Really appreciate it!
>>
>> Give me some time i'll come back after I've tried your suggestions.
>>
>> On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren  wrote:
>>> I cannot reproduce it by running the file through Spark in local mode
>>> on my machine. So it does indeed seems to be something related to
>>> split across partitions.
>>>
>>> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren  
>>> wrote:
 Can you do remote debugging in Spark? Didn't know that. Do you have a link?

 Also noticed isSplittable in
 org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
 org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
 are some way to tell it not to split?

 On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen  wrote:
> It really sounds like the line is being split across partitions. This
> is what TextInputFormat does but should be perfectly capable of
> putting together lines that break across files (partitions). If you're
> into debugging, that's where I would start if you can. Breakpoints
> around how TextInputFormat is parsing lines. See if you can catch it
> when it returns a line that doesn't contain what you expect.
>
> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  
> wrote:
>> That's funny. The line after is the rest of the whole line that got
>> split in half. Every following lines after that are fine.
>>
>> I managed to reproduce without gzip also so maybe it's no gzip's fault
>> after all..
>>
>> I'm clueless...
>>
>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  
>> wrote:
>>> Seems like it's the gzip. It works if download the file, gunzip and
>>> put it back to another directory and read it the same way.
>>>
>>> Hm.. I wonder what happens with the lines after it..
>>>
>>>
>>>
>>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
 What if you read it uncompressed from HDFS?
 gzip compression is unfriendly to MR in that it can't split the file.
 It still should just work, certainly if the line is in one file. But,
 a data point worth having.

 On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren 
  wrote:
> The line is in one file. I did download the file manually from HDFS,
> read and decoded it line-by-line successfully without Spark.
>
>
>
> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  
> wrote:
>> The only thing I can think of is that a line is being broken across 
>> two files?
>> Hadoop easily puts things back together in this case, or should. 
>> There
>> could be some weird factor preventing that. One first place to look:
>> are you using a weird line separator? or at least different from the
>> host OS?
>>
>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren 
>>  wrote:
>>> I should mention that we're in the end want to store the input from
>>> Protobuf binary to Parquet using the following code. But this comes
>>> after the lines has been decoded from base64 into binary.
>>>
>>>
>>> public static  void save(JavaRDD rdd, Class
>>> clazz, String path) {
>>>   try {
>>> Job job = Job.getInstance();
>>> ParquetOutputFormat.setWriteSupportClass(job, 
>>> ProtoWriteSupport.class);
>>> ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>>> rdd.mapToPair(order -> new Tuple2<>(null, order))
>>>   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>>> ParquetOutputFormat.class, job.getConfiguration());
>>>   } catch (IOException e) {
>>> throw new RuntimeException(e);
>>>   }
>>> }
>>>
>>>
>>>
>>> 
>>>   org.apache.parquet
>>>   parquet-protob

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren

I'm pretty confident the lines are encoded correctly since I can read
them both locally and on Spark (by ignoring the faulty line and
proceed to next). I also get the correct number of lines through
Spark, again by ignoring the faulty line.

I get the same error by reading the original file using Spark, save as
new text file, then try decoding again.

context.textFile("/orgfile").saveAsTextFile("/newfile");

Ok, not much left than to do some remote debugging.


On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren  wrote:
> Thanks for you help. Really appreciate it!
>
> Give me some time i'll come back after I've tried your suggestions.
>
> On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren  wrote:
>> I cannot reproduce it by running the file through Spark in local mode
>> on my machine. So it does indeed seems to be something related to
>> split across partitions.
>>
>> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren  wrote:
>>> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
>>>
>>> Also noticed isSplittable in
>>> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
>>> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
>>> are some way to tell it not to split?
>>>
>>> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen  wrote:
 It really sounds like the line is being split across partitions. This
 is what TextInputFormat does but should be perfectly capable of
 putting together lines that break across files (partitions). If you're
 into debugging, that's where I would start if you can. Breakpoints
 around how TextInputFormat is parsing lines. See if you can catch it
 when it returns a line that doesn't contain what you expect.

 On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  
 wrote:
> That's funny. The line after is the rest of the whole line that got
> split in half. Every following lines after that are fine.
>
> I managed to reproduce without gzip also so maybe it's no gzip's fault
> after all..
>
> I'm clueless...
>
> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  
> wrote:
>> Seems like it's the gzip. It works if download the file, gunzip and
>> put it back to another directory and read it the same way.
>>
>> Hm.. I wonder what happens with the lines after it..
>>
>>
>>
>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
>>> What if you read it uncompressed from HDFS?
>>> gzip compression is unfriendly to MR in that it can't split the file.
>>> It still should just work, certainly if the line is in one file. But,
>>> a data point worth having.
>>>
>>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
>>> wrote:
 The line is in one file. I did download the file manually from HDFS,
 read and decoded it line-by-line successfully without Spark.



 On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
> The only thing I can think of is that a line is being broken across 
> two files?
> Hadoop easily puts things back together in this case, or should. There
> could be some weird factor preventing that. One first place to look:
> are you using a weird line separator? or at least different from the
> host OS?
>
> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren 
>  wrote:
>> I should mention that we're in the end want to store the input from
>> Protobuf binary to Parquet using the following code. But this comes
>> after the lines has been decoded from base64 into binary.
>>
>>
>> public static  void save(JavaRDD rdd, Class
>> clazz, String path) {
>>   try {
>> Job job = Job.getInstance();
>> ParquetOutputFormat.setWriteSupportClass(job, 
>> ProtoWriteSupport.class);
>> ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>> rdd.mapToPair(order -> new Tuple2<>(null, order))
>>   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>> ParquetOutputFormat.class, job.getConfiguration());
>>   } catch (IOException e) {
>> throw new RuntimeException(e);
>>   }
>> }
>>
>>
>>
>> 
>>   org.apache.parquet
>>   parquet-protobuf
>>   1.8.1
>> 
>>
>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren 
>>  wrote:
>>> I'm trying to figure out exactly what information could be useful 
>>> but
>>> it's all as straight forward.
>>>
>>> - It's text files
>>> - Lines ends with a new line character.
>>> - Files are gzipped before added to HDFS
>>> - Files are read as gzipped files from HDFS by Spark
>>> - There are some extra configuration
>>>
>>>

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren

Thanks for you help. Really appreciate it!

Give me some time i'll come back after I've tried your suggestions.

On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren  wrote:
> I cannot reproduce it by running the file through Spark in local mode
> on my machine. So it does indeed seems to be something related to
> split across partitions.
>
> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren  wrote:
>> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
>>
>> Also noticed isSplittable in
>> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
>> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
>> are some way to tell it not to split?
>>
>> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen  wrote:
>>> It really sounds like the line is being split across partitions. This
>>> is what TextInputFormat does but should be perfectly capable of
>>> putting together lines that break across files (partitions). If you're
>>> into debugging, that's where I would start if you can. Breakpoints
>>> around how TextInputFormat is parsing lines. See if you can catch it
>>> when it returns a line that doesn't contain what you expect.
>>>
>>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  
>>> wrote:
 That's funny. The line after is the rest of the whole line that got
 split in half. Every following lines after that are fine.

 I managed to reproduce without gzip also so maybe it's no gzip's fault
 after all..

 I'm clueless...

 On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  
 wrote:
> Seems like it's the gzip. It works if download the file, gunzip and
> put it back to another directory and read it the same way.
>
> Hm.. I wonder what happens with the lines after it..
>
>
>
> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
>> What if you read it uncompressed from HDFS?
>> gzip compression is unfriendly to MR in that it can't split the file.
>> It still should just work, certainly if the line is in one file. But,
>> a data point worth having.
>>
>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
>> wrote:
>>> The line is in one file. I did download the file manually from HDFS,
>>> read and decoded it line-by-line successfully without Spark.
>>>
>>>
>>>
>>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
 The only thing I can think of is that a line is being broken across 
 two files?
 Hadoop easily puts things back together in this case, or should. There
 could be some weird factor preventing that. One first place to look:
 are you using a weird line separator? or at least different from the
 host OS?

 On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren 
  wrote:
> I should mention that we're in the end want to store the input from
> Protobuf binary to Parquet using the following code. But this comes
> after the lines has been decoded from base64 into binary.
>
>
> public static  void save(JavaRDD rdd, Class
> clazz, String path) {
>   try {
> Job job = Job.getInstance();
> ParquetOutputFormat.setWriteSupportClass(job, 
> ProtoWriteSupport.class);
> ProtoParquetOutputFormat.setProtobufClass(job, clazz);
> rdd.mapToPair(order -> new Tuple2<>(null, order))
>   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
> ParquetOutputFormat.class, job.getConfiguration());
>   } catch (IOException e) {
> throw new RuntimeException(e);
>   }
> }
>
>
>
> 
>   org.apache.parquet
>   parquet-protobuf
>   1.8.1
> 
>
> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren 
>  wrote:
>> I'm trying to figure out exactly what information could be useful but
>> it's all as straight forward.
>>
>> - It's text files
>> - Lines ends with a new line character.
>> - Files are gzipped before added to HDFS
>> - Files are read as gzipped files from HDFS by Spark
>> - There are some extra configuration
>>
>> conf.set("spark.files.overwrite", "true");
>> conf.set("spark.hadoop.validateOutputSpecs", "false");
>>
>> Here's the code using Java 8 Base64 class.
>>
>> context.textFile("/log.gz")
>> .map(line -> line.split("×tamp="))
>> .map(split -> Base64.getDecoder().decode(split[0]));
>>
>>
>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen  
>> wrote:
>>> It's really the MR InputSplit code that splits files into records.
>>> Nothing particularly interesting happens in that process, except for
>>> breaking on newlines.
>>>
>>>

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen

It takes a little setup, but you can do remote debugging:
http://danosipov.com/?p=779  ... and then use similar config to
connect your IDE to a running executor.

Before that you might strip your program down to only a call to
textFile that then checks the lines according to whatever logic would
decide whether it is valid.

gzip isn't splittable, so you should already have one partition per
file instead of potentially several per file. If the line is entirely
in one file then, hm, it really shouldn't be that issue.

Are you sure lines before and after are parsed correctly? wondering if
somehow you are parsing a huge amount of text as a line before it and
this is just where it happens to finally hit some buffer limit. Any
weird Hadoop settings like a small block size?

I suspect there is something more basic going on here. Like are you
sure that the line you get in your program is truly not a line in the
input? you have another line here that has it as a prefix but ... is
that really the same line of input?

On Tue, Jun 14, 2016 at 2:04 PM, Kristoffer Sjögren  wrote:
> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
>
> Also noticed isSplittable in
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
> are some way to tell it not to split?
>
> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen  wrote:
>> It really sounds like the line is being split across partitions. This
>> is what TextInputFormat does but should be perfectly capable of
>> putting together lines that break across files (partitions). If you're
>> into debugging, that's where I would start if you can. Breakpoints
>> around how TextInputFormat is parsing lines. See if you can catch it
>> when it returns a line that doesn't contain what you expect.
>>
>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  wrote:
>>> That's funny. The line after is the rest of the whole line that got
>>> split in half. Every following lines after that are fine.
>>>
>>> I managed to reproduce without gzip also so maybe it's no gzip's fault
>>> after all..
>>>
>>> I'm clueless...
>>>
>>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  
>>> wrote:
 Seems like it's the gzip. It works if download the file, gunzip and
 put it back to another directory and read it the same way.

 Hm.. I wonder what happens with the lines after it..



 On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
> What if you read it uncompressed from HDFS?
> gzip compression is unfriendly to MR in that it can't split the file.
> It still should just work, certainly if the line is in one file. But,
> a data point worth having.
>
> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
> wrote:
>> The line is in one file. I did download the file manually from HDFS,
>> read and decoded it line-by-line successfully without Spark.
>>
>>
>>
>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
>>> The only thing I can think of is that a line is being broken across two 
>>> files?
>>> Hadoop easily puts things back together in this case, or should. There
>>> could be some weird factor preventing that. One first place to look:
>>> are you using a weird line separator? or at least different from the
>>> host OS?
>>>
>>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren  
>>> wrote:
 I should mention that we're in the end want to store the input from
 Protobuf binary to Parquet using the following code. But this comes
 after the lines has been decoded from base64 into binary.


 public static  void save(JavaRDD rdd, Class
 clazz, String path) {
   try {
 Job job = Job.getInstance();
 ParquetOutputFormat.setWriteSupportClass(job, 
 ProtoWriteSupport.class);
 ProtoParquetOutputFormat.setProtobufClass(job, clazz);
 rdd.mapToPair(order -> new Tuple2<>(null, order))
   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
 ParquetOutputFormat.class, job.getConfiguration());
   } catch (IOException e) {
 throw new RuntimeException(e);
   }
 }



 
   org.apache.parquet
   parquet-protobuf
   1.8.1
 

 On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren 
  wrote:
> I'm trying to figure out exactly what information could be useful but
> it's all as straight forward.
>
> - It's text files
> - Lines ends with a new line character.
> - Files are gzipped before added to HDFS
> - Files are read as gzipped files from HDFS by Spark
> - There are some extra configuration
>
> conf.set("spark.files.overwrite", "true");
> conf.set("spark.ha

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren

I cannot reproduce it by running the file through Spark in local mode
on my machine. So it does indeed seems to be something related to
split across partitions.

On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren  wrote:
> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
>
> Also noticed isSplittable in
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
> are some way to tell it not to split?
>
> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen  wrote:
>> It really sounds like the line is being split across partitions. This
>> is what TextInputFormat does but should be perfectly capable of
>> putting together lines that break across files (partitions). If you're
>> into debugging, that's where I would start if you can. Breakpoints
>> around how TextInputFormat is parsing lines. See if you can catch it
>> when it returns a line that doesn't contain what you expect.
>>
>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  wrote:
>>> That's funny. The line after is the rest of the whole line that got
>>> split in half. Every following lines after that are fine.
>>>
>>> I managed to reproduce without gzip also so maybe it's no gzip's fault
>>> after all..
>>>
>>> I'm clueless...
>>>
>>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  
>>> wrote:
 Seems like it's the gzip. It works if download the file, gunzip and
 put it back to another directory and read it the same way.

 Hm.. I wonder what happens with the lines after it..



 On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
> What if you read it uncompressed from HDFS?
> gzip compression is unfriendly to MR in that it can't split the file.
> It still should just work, certainly if the line is in one file. But,
> a data point worth having.
>
> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
> wrote:
>> The line is in one file. I did download the file manually from HDFS,
>> read and decoded it line-by-line successfully without Spark.
>>
>>
>>
>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
>>> The only thing I can think of is that a line is being broken across two 
>>> files?
>>> Hadoop easily puts things back together in this case, or should. There
>>> could be some weird factor preventing that. One first place to look:
>>> are you using a weird line separator? or at least different from the
>>> host OS?
>>>
>>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren  
>>> wrote:
 I should mention that we're in the end want to store the input from
 Protobuf binary to Parquet using the following code. But this comes
 after the lines has been decoded from base64 into binary.


 public static  void save(JavaRDD rdd, Class
 clazz, String path) {
   try {
 Job job = Job.getInstance();
 ParquetOutputFormat.setWriteSupportClass(job, 
 ProtoWriteSupport.class);
 ProtoParquetOutputFormat.setProtobufClass(job, clazz);
 rdd.mapToPair(order -> new Tuple2<>(null, order))
   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
 ParquetOutputFormat.class, job.getConfiguration());
   } catch (IOException e) {
 throw new RuntimeException(e);
   }
 }



 
   org.apache.parquet
   parquet-protobuf
   1.8.1
 

 On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren 
  wrote:
> I'm trying to figure out exactly what information could be useful but
> it's all as straight forward.
>
> - It's text files
> - Lines ends with a new line character.
> - Files are gzipped before added to HDFS
> - Files are read as gzipped files from HDFS by Spark
> - There are some extra configuration
>
> conf.set("spark.files.overwrite", "true");
> conf.set("spark.hadoop.validateOutputSpecs", "false");
>
> Here's the code using Java 8 Base64 class.
>
> context.textFile("/log.gz")
> .map(line -> line.split("×tamp="))
> .map(split -> Base64.getDecoder().decode(split[0]));
>
>
> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen  
> wrote:
>> It's really the MR InputSplit code that splits files into records.
>> Nothing particularly interesting happens in that process, except for
>> breaking on newlines.
>>
>> Do you have one huge line in the file? are you reading as a text 
>> file?
>> can you give any more detail about exactly how you parse it? it could
>> be something else in your code.
>>
>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
>>  wrot

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren

Can you do remote debugging in Spark? Didn't know that. Do you have a link?

Also noticed isSplittable in
org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
are some way to tell it not to split?

On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen  wrote:
> It really sounds like the line is being split across partitions. This
> is what TextInputFormat does but should be perfectly capable of
> putting together lines that break across files (partitions). If you're
> into debugging, that's where I would start if you can. Breakpoints
> around how TextInputFormat is parsing lines. See if you can catch it
> when it returns a line that doesn't contain what you expect.
>
> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  wrote:
>> That's funny. The line after is the rest of the whole line that got
>> split in half. Every following lines after that are fine.
>>
>> I managed to reproduce without gzip also so maybe it's no gzip's fault
>> after all..
>>
>> I'm clueless...
>>
>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  
>> wrote:
>>> Seems like it's the gzip. It works if download the file, gunzip and
>>> put it back to another directory and read it the same way.
>>>
>>> Hm.. I wonder what happens with the lines after it..
>>>
>>>
>>>
>>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
 What if you read it uncompressed from HDFS?
 gzip compression is unfriendly to MR in that it can't split the file.
 It still should just work, certainly if the line is in one file. But,
 a data point worth having.

 On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
 wrote:
> The line is in one file. I did download the file manually from HDFS,
> read and decoded it line-by-line successfully without Spark.
>
>
>
> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
>> The only thing I can think of is that a line is being broken across two 
>> files?
>> Hadoop easily puts things back together in this case, or should. There
>> could be some weird factor preventing that. One first place to look:
>> are you using a weird line separator? or at least different from the
>> host OS?
>>
>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren  
>> wrote:
>>> I should mention that we're in the end want to store the input from
>>> Protobuf binary to Parquet using the following code. But this comes
>>> after the lines has been decoded from base64 into binary.
>>>
>>>
>>> public static  void save(JavaRDD rdd, Class
>>> clazz, String path) {
>>>   try {
>>> Job job = Job.getInstance();
>>> ParquetOutputFormat.setWriteSupportClass(job, 
>>> ProtoWriteSupport.class);
>>> ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>>> rdd.mapToPair(order -> new Tuple2<>(null, order))
>>>   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>>> ParquetOutputFormat.class, job.getConfiguration());
>>>   } catch (IOException e) {
>>> throw new RuntimeException(e);
>>>   }
>>> }
>>>
>>>
>>>
>>> 
>>>   org.apache.parquet
>>>   parquet-protobuf
>>>   1.8.1
>>> 
>>>
>>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren  
>>> wrote:
 I'm trying to figure out exactly what information could be useful but
 it's all as straight forward.

 - It's text files
 - Lines ends with a new line character.
 - Files are gzipped before added to HDFS
 - Files are read as gzipped files from HDFS by Spark
 - There are some extra configuration

 conf.set("spark.files.overwrite", "true");
 conf.set("spark.hadoop.validateOutputSpecs", "false");

 Here's the code using Java 8 Base64 class.

 context.textFile("/log.gz")
 .map(line -> line.split("×tamp="))
 .map(split -> Base64.getDecoder().decode(split[0]));


 On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen  wrote:
> It's really the MR InputSplit code that splits files into records.
> Nothing particularly interesting happens in that process, except for
> breaking on newlines.
>
> Do you have one huge line in the file? are you reading as a text file?
> can you give any more detail about exactly how you parse it? it could
> be something else in your code.
>
> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
>  wrote:
>> Hi
>>
>> We have log files that are written in base64 encoded text files
>> (gzipped) where each line is ended with a new line character.
>>
>> For some reason a particular line [1] is split by Spark [2] making it
>> unparsable by the base64 decoder. It does this consequently no matter
>> if I gi

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen

It really sounds like the line is being split across partitions. This
is what TextInputFormat does but should be perfectly capable of
putting together lines that break across files (partitions). If you're
into debugging, that's where I would start if you can. Breakpoints
around how TextInputFormat is parsing lines. See if you can catch it
when it returns a line that doesn't contain what you expect.

On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren  wrote:
> That's funny. The line after is the rest of the whole line that got
> split in half. Every following lines after that are fine.
>
> I managed to reproduce without gzip also so maybe it's no gzip's fault
> after all..
>
> I'm clueless...
>
> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  wrote:
>> Seems like it's the gzip. It works if download the file, gunzip and
>> put it back to another directory and read it the same way.
>>
>> Hm.. I wonder what happens with the lines after it..
>>
>>
>>
>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
>>> What if you read it uncompressed from HDFS?
>>> gzip compression is unfriendly to MR in that it can't split the file.
>>> It still should just work, certainly if the line is in one file. But,
>>> a data point worth having.
>>>
>>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
>>> wrote:
 The line is in one file. I did download the file manually from HDFS,
 read and decoded it line-by-line successfully without Spark.



 On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
> The only thing I can think of is that a line is being broken across two 
> files?
> Hadoop easily puts things back together in this case, or should. There
> could be some weird factor preventing that. One first place to look:
> are you using a weird line separator? or at least different from the
> host OS?
>
> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren  
> wrote:
>> I should mention that we're in the end want to store the input from
>> Protobuf binary to Parquet using the following code. But this comes
>> after the lines has been decoded from base64 into binary.
>>
>>
>> public static  void save(JavaRDD rdd, Class
>> clazz, String path) {
>>   try {
>> Job job = Job.getInstance();
>> ParquetOutputFormat.setWriteSupportClass(job, 
>> ProtoWriteSupport.class);
>> ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>> rdd.mapToPair(order -> new Tuple2<>(null, order))
>>   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>> ParquetOutputFormat.class, job.getConfiguration());
>>   } catch (IOException e) {
>> throw new RuntimeException(e);
>>   }
>> }
>>
>>
>>
>> 
>>   org.apache.parquet
>>   parquet-protobuf
>>   1.8.1
>> 
>>
>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren  
>> wrote:
>>> I'm trying to figure out exactly what information could be useful but
>>> it's all as straight forward.
>>>
>>> - It's text files
>>> - Lines ends with a new line character.
>>> - Files are gzipped before added to HDFS
>>> - Files are read as gzipped files from HDFS by Spark
>>> - There are some extra configuration
>>>
>>> conf.set("spark.files.overwrite", "true");
>>> conf.set("spark.hadoop.validateOutputSpecs", "false");
>>>
>>> Here's the code using Java 8 Base64 class.
>>>
>>> context.textFile("/log.gz")
>>> .map(line -> line.split("×tamp="))
>>> .map(split -> Base64.getDecoder().decode(split[0]));
>>>
>>>
>>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen  wrote:
 It's really the MR InputSplit code that splits files into records.
 Nothing particularly interesting happens in that process, except for
 breaking on newlines.

 Do you have one huge line in the file? are you reading as a text file?
 can you give any more detail about exactly how you parse it? it could
 be something else in your code.

 On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
  wrote:
> Hi
>
> We have log files that are written in base64 encoded text files
> (gzipped) where each line is ended with a new line character.
>
> For some reason a particular line [1] is split by Spark [2] making it
> unparsable by the base64 decoder. It does this consequently no matter
> if I gives it the particular file that contain the line or a bunch of
> files.
>
> I know the line is not corrupt because I can manually download the
> file from HDFS, gunzip it and read/decode all the lines without
> problems.
>
> Was thinking that maybe there is a limit to number of characters per
> line but that doesn't sound right? Maybe the combination of characters
> makes Spark think it's

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren

That's funny. The line after is the rest of the whole line that got
split in half. Every following lines after that are fine.

I managed to reproduce without gzip also so maybe it's no gzip's fault
after all..

I'm clueless...

On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren  wrote:
> Seems like it's the gzip. It works if download the file, gunzip and
> put it back to another directory and read it the same way.
>
> Hm.. I wonder what happens with the lines after it..
>
>
>
> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen  wrote:
>> What if you read it uncompressed from HDFS?
>> gzip compression is unfriendly to MR in that it can't split the file.
>> It still should just work, certainly if the line is in one file. But,
>> a data point worth having.
>>
>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren  
>> wrote:
>>> The line is in one file. I did download the file manually from HDFS,
>>> read and decoded it line-by-line successfully without Spark.
>>>
>>>
>>>
>>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen  wrote:
 The only thing I can think of is that a line is being broken across two 
 files?
 Hadoop easily puts things back together in this case, or should. There
 could be some weird factor preventing that. One first place to look:
 are you using a weird line separator? or at least different from the
 host OS?

 On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren  
 wrote:
> I should mention that we're in the end want to store the input from
> Protobuf binary to Parquet using the following code. But this comes
> after the lines has been decoded from base64 into binary.
>
>
> public static  void save(JavaRDD rdd, Class
> clazz, String path) {
>   try {
> Job job = Job.getInstance();
> ParquetOutputFormat.setWriteSupportClass(job, 
> ProtoWriteSupport.class);
> ProtoParquetOutputFormat.setProtobufClass(job, clazz);
> rdd.mapToPair(order -> new Tuple2<>(null, order))
>   .saveAsNewAPIHadoopFile(path, Void.class, clazz,
> ParquetOutputFormat.class, job.getConfiguration());
>   } catch (IOException e) {
> throw new RuntimeException(e);
>   }
> }
>
>
>
> 
>   org.apache.parquet
>   parquet-protobuf
>   1.8.1
> 
>
> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren  
> wrote:
>> I'm trying to figure out exactly what information could be useful but
>> it's all as straight forward.
>>
>> - It's text files
>> - Lines ends with a new line character.
>> - Files are gzipped before added to HDFS
>> - Files are read as gzipped files from HDFS by Spark
>> - There are some extra configuration
>>
>> conf.set("spark.files.overwrite", "true");
>> conf.set("spark.hadoop.validateOutputSpecs", "false");
>>
>> Here's the code using Java 8 Base64 class.
>>
>> context.textFile("/log.gz")
>> .map(line -> line.split("×tamp="))
>> .map(split -> Base64.getDecoder().decode(split[0]));
>>
>>
>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen  wrote:
>>> It's really the MR InputSplit code that splits files into records.
>>> Nothing particularly interesting happens in that process, except for
>>> breaking on newlines.
>>>
>>> Do you have one huge line in the file? are you reading as a text file?
>>> can you give any more detail about exactly how you parse it? it could
>>> be something else in your code.
>>>
>>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren  
>>> wrote:
 Hi

 We have log files that are written in base64 encoded text files
 (gzipped) where each line is ended with a new line character.

 For some reason a particular line [1] is split by Spark [2] making it
 unparsable by the base64 decoder. It does this consequently no matter
 if I gives it the particular file that contain the line or a bunch of
 files.

 I know the line is not corrupt because I can manually download the
 file from HDFS, gunzip it and read/decode all the lines without
 problems.

 Was thinking that maybe there is a limit to number of characters per
 line but that doesn't sound right? Maybe the combination of characters
 makes Spark think it's new line?

 I'm clueless.

 Cheers,
 -Kristoffer

 [1] Original line:

 CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRy

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang

Can you read this file using MR job ?

On Tue, Jun 14, 2016 at 5:26 PM, Sean Owen  wrote:

> It's really the MR InputSplit code that splits files into records.
> Nothing particularly interesting happens in that process, except for
> breaking on newlines.
>
> Do you have one huge line in the file? are you reading as a text file?
> can you give any more detail about exactly how you parse it? it could
> be something else in your code.
>
> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
> wrote:
> > Hi
> >
> > We have log files that are written in base64 encoded text files
> > (gzipped) where each line is ended with a new line character.
> >
> > For some reason a particular line [1] is split by Spark [2] making it
> > unparsable by the base64 decoder. It does this consequently no matter
> > if I gives it the particular file that contain the line or a bunch of
> > files.
> >
> > I know the line is not corrupt because I can manually download the
> > file from HDFS, gunzip it and read/decode all the lines without
> > problems.
> >
> > Was thinking that maybe there is a limit to number of characters per
> > line but that doesn't sound right? Maybe the combination of characters
> > makes Spark think it's new line?
> >
> > I'm clueless.
> >
> > Cheers,
> > -Kristoffer
> >
> > [1] Original line:
> >
> >
> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==×tamp=1465887564
> >
> >
> > [2] Line as spark hands it over:
> >
> >
> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen

It's really the MR InputSplit code that splits files into records.
Nothing particularly interesting happens in that process, except for
breaking on newlines.

Do you have one huge line in the file? are you reading as a text file?
can you give any more detail about exactly how you parse it? it could
be something else in your code.

On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren  wrote:
> Hi
>
> We have log files that are written in base64 encoded text files
> (gzipped) where each line is ended with a new line character.
>
> For some reason a particular line [1] is split by Spark [2] making it
> unparsable by the base64 decoder. It does this consequently no matter
> if I gives it the particular file that contain the line or a bunch of
> files.
>
> I know the line is not corrupt because I can manually download the
> file from HDFS, gunzip it and read/decode all the lines without
> problems.
>
> Was thinking that maybe there is a limit to number of characters per
> line but that doesn't sound right? Maybe the combination of characters
> makes Spark think it's new line?
>
> I'm clueless.
>
> Cheers,
> -Kristoffer
>
> [1] Original line:
>
> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==×tamp=1465887564
>
>
> [2] Line as spark hands it over:
>
> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

Re: Spark corrupts text lines

10 matches

Site Navigation

Mail list logo

Footer information