Re: Spark corrupts text lines
I managed to get remote debugging up and running and can in fact reproduce the error and get a breakpoint triggered as it happens. But it seems like the code does not go through TextInputFormat, or at least the breakpoint is not triggered from this class? Don't know what other class to look for the actual split could to occur? Any pointers? On Tue, Jun 14, 2016 at 4:03 PM, Kristoffer Sjögren wrote: > I'm pretty confident the lines are encoded correctly since I can read > them both locally and on Spark (by ignoring the faulty line and > proceed to next). I also get the correct number of lines through > Spark, again by ignoring the faulty line. > > I get the same error by reading the original file using Spark, save as > new text file, then try decoding again. > > context.textFile("/orgfile").saveAsTextFile("/newfile"); > > Ok, not much left than to do some remote debugging. > > > On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren wrote: >> Thanks for you help. Really appreciate it! >> >> Give me some time i'll come back after I've tried your suggestions. >> >> On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren wrote: >>> I cannot reproduce it by running the file through Spark in local mode >>> on my machine. So it does indeed seems to be something related to >>> split across partitions. >>> >>> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren >>> wrote: Can you do remote debugging in Spark? Didn't know that. Do you have a link? Also noticed isSplittable in org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there are some way to tell it not to split? On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen wrote: > It really sounds like the line is being split across partitions. This > is what TextInputFormat does but should be perfectly capable of > putting together lines that break across files (partitions). If you're > into debugging, that's where I would start if you can. Breakpoints > around how TextInputFormat is parsing lines. See if you can catch it > when it returns a line that doesn't contain what you expect. > > On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren > wrote: >> That's funny. The line after is the rest of the whole line that got >> split in half. Every following lines after that are fine. >> >> I managed to reproduce without gzip also so maybe it's no gzip's fault >> after all.. >> >> I'm clueless... >> >> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren >> wrote: >>> Seems like it's the gzip. It works if download the file, gunzip and >>> put it back to another directory and read it the same way. >>> >>> Hm.. I wonder what happens with the lines after it.. >>> >>> >>> >>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: What if you read it uncompressed from HDFS? gzip compression is unfriendly to MR in that it can't split the file. It still should just work, certainly if the line is in one file. But, a data point worth having. On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren wrote: > The line is in one file. I did download the file manually from HDFS, > read and decoded it line-by-line successfully without Spark. > > > > On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen > wrote: >> The only thing I can think of is that a line is being broken across >> two files? >> Hadoop easily puts things back together in this case, or should. >> There >> could be some weird factor preventing that. One first place to look: >> are you using a weird line separator? or at least different from the >> host OS? >> >> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren >> wrote: >>> I should mention that we're in the end want to store the input from >>> Protobuf binary to Parquet using the following code. But this comes >>> after the lines has been decoded from base64 into binary. >>> >>> >>> public static void save(JavaRDD rdd, Class >>> clazz, String path) { >>> try { >>> Job job = Job.getInstance(); >>> ParquetOutputFormat.setWriteSupportClass(job, >>> ProtoWriteSupport.class); >>> ProtoParquetOutputFormat.setProtobufClass(job, clazz); >>> rdd.mapToPair(order -> new Tuple2<>(null, order)) >>> .saveAsNewAPIHadoopFile(path, Void.class, clazz, >>> ParquetOutputFormat.class, job.getConfiguration()); >>> } catch (IOException e) { >>> throw new RuntimeException(e); >>> } >>> } >>> >>> >>> >>> >>> org.apache.parquet >>> parquet-protob
Re: Spark corrupts text lines
I'm pretty confident the lines are encoded correctly since I can read them both locally and on Spark (by ignoring the faulty line and proceed to next). I also get the correct number of lines through Spark, again by ignoring the faulty line. I get the same error by reading the original file using Spark, save as new text file, then try decoding again. context.textFile("/orgfile").saveAsTextFile("/newfile"); Ok, not much left than to do some remote debugging. On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren wrote: > Thanks for you help. Really appreciate it! > > Give me some time i'll come back after I've tried your suggestions. > > On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren wrote: >> I cannot reproduce it by running the file through Spark in local mode >> on my machine. So it does indeed seems to be something related to >> split across partitions. >> >> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren wrote: >>> Can you do remote debugging in Spark? Didn't know that. Do you have a link? >>> >>> Also noticed isSplittable in >>> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for >>> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there >>> are some way to tell it not to split? >>> >>> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen wrote: It really sounds like the line is being split across partitions. This is what TextInputFormat does but should be perfectly capable of putting together lines that break across files (partitions). If you're into debugging, that's where I would start if you can. Breakpoints around how TextInputFormat is parsing lines. See if you can catch it when it returns a line that doesn't contain what you expect. On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren wrote: > That's funny. The line after is the rest of the whole line that got > split in half. Every following lines after that are fine. > > I managed to reproduce without gzip also so maybe it's no gzip's fault > after all.. > > I'm clueless... > > On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren > wrote: >> Seems like it's the gzip. It works if download the file, gunzip and >> put it back to another directory and read it the same way. >> >> Hm.. I wonder what happens with the lines after it.. >> >> >> >> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: >>> What if you read it uncompressed from HDFS? >>> gzip compression is unfriendly to MR in that it can't split the file. >>> It still should just work, certainly if the line is in one file. But, >>> a data point worth having. >>> >>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren >>> wrote: The line is in one file. I did download the file manually from HDFS, read and decoded it line-by-line successfully without Spark. On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: > The only thing I can think of is that a line is being broken across > two files? > Hadoop easily puts things back together in this case, or should. There > could be some weird factor preventing that. One first place to look: > are you using a weird line separator? or at least different from the > host OS? > > On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren > wrote: >> I should mention that we're in the end want to store the input from >> Protobuf binary to Parquet using the following code. But this comes >> after the lines has been decoded from base64 into binary. >> >> >> public static void save(JavaRDD rdd, Class >> clazz, String path) { >> try { >> Job job = Job.getInstance(); >> ParquetOutputFormat.setWriteSupportClass(job, >> ProtoWriteSupport.class); >> ProtoParquetOutputFormat.setProtobufClass(job, clazz); >> rdd.mapToPair(order -> new Tuple2<>(null, order)) >> .saveAsNewAPIHadoopFile(path, Void.class, clazz, >> ParquetOutputFormat.class, job.getConfiguration()); >> } catch (IOException e) { >> throw new RuntimeException(e); >> } >> } >> >> >> >> >> org.apache.parquet >> parquet-protobuf >> 1.8.1 >> >> >> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren >> wrote: >>> I'm trying to figure out exactly what information could be useful >>> but >>> it's all as straight forward. >>> >>> - It's text files >>> - Lines ends with a new line character. >>> - Files are gzipped before added to HDFS >>> - Files are read as gzipped files from HDFS by Spark >>> - There are some extra configuration >>> >>>
Re: Spark corrupts text lines
Thanks for you help. Really appreciate it! Give me some time i'll come back after I've tried your suggestions. On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren wrote: > I cannot reproduce it by running the file through Spark in local mode > on my machine. So it does indeed seems to be something related to > split across partitions. > > On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren wrote: >> Can you do remote debugging in Spark? Didn't know that. Do you have a link? >> >> Also noticed isSplittable in >> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for >> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there >> are some way to tell it not to split? >> >> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen wrote: >>> It really sounds like the line is being split across partitions. This >>> is what TextInputFormat does but should be perfectly capable of >>> putting together lines that break across files (partitions). If you're >>> into debugging, that's where I would start if you can. Breakpoints >>> around how TextInputFormat is parsing lines. See if you can catch it >>> when it returns a line that doesn't contain what you expect. >>> >>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren >>> wrote: That's funny. The line after is the rest of the whole line that got split in half. Every following lines after that are fine. I managed to reproduce without gzip also so maybe it's no gzip's fault after all.. I'm clueless... On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren wrote: > Seems like it's the gzip. It works if download the file, gunzip and > put it back to another directory and read it the same way. > > Hm.. I wonder what happens with the lines after it.. > > > > On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: >> What if you read it uncompressed from HDFS? >> gzip compression is unfriendly to MR in that it can't split the file. >> It still should just work, certainly if the line is in one file. But, >> a data point worth having. >> >> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren >> wrote: >>> The line is in one file. I did download the file manually from HDFS, >>> read and decoded it line-by-line successfully without Spark. >>> >>> >>> >>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: The only thing I can think of is that a line is being broken across two files? Hadoop easily puts things back together in this case, or should. There could be some weird factor preventing that. One first place to look: are you using a weird line separator? or at least different from the host OS? On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren wrote: > I should mention that we're in the end want to store the input from > Protobuf binary to Parquet using the following code. But this comes > after the lines has been decoded from base64 into binary. > > > public static void save(JavaRDD rdd, Class > clazz, String path) { > try { > Job job = Job.getInstance(); > ParquetOutputFormat.setWriteSupportClass(job, > ProtoWriteSupport.class); > ProtoParquetOutputFormat.setProtobufClass(job, clazz); > rdd.mapToPair(order -> new Tuple2<>(null, order)) > .saveAsNewAPIHadoopFile(path, Void.class, clazz, > ParquetOutputFormat.class, job.getConfiguration()); > } catch (IOException e) { > throw new RuntimeException(e); > } > } > > > > > org.apache.parquet > parquet-protobuf > 1.8.1 > > > On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren > wrote: >> I'm trying to figure out exactly what information could be useful but >> it's all as straight forward. >> >> - It's text files >> - Lines ends with a new line character. >> - Files are gzipped before added to HDFS >> - Files are read as gzipped files from HDFS by Spark >> - There are some extra configuration >> >> conf.set("spark.files.overwrite", "true"); >> conf.set("spark.hadoop.validateOutputSpecs", "false"); >> >> Here's the code using Java 8 Base64 class. >> >> context.textFile("/log.gz") >> .map(line -> line.split("×tamp=")) >> .map(split -> Base64.getDecoder().decode(split[0])); >> >> >> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen >> wrote: >>> It's really the MR InputSplit code that splits files into records. >>> Nothing particularly interesting happens in that process, except for >>> breaking on newlines. >>> >>>
Re: Spark corrupts text lines
It takes a little setup, but you can do remote debugging: http://danosipov.com/?p=779 ... and then use similar config to connect your IDE to a running executor. Before that you might strip your program down to only a call to textFile that then checks the lines according to whatever logic would decide whether it is valid. gzip isn't splittable, so you should already have one partition per file instead of potentially several per file. If the line is entirely in one file then, hm, it really shouldn't be that issue. Are you sure lines before and after are parsed correctly? wondering if somehow you are parsing a huge amount of text as a line before it and this is just where it happens to finally hit some buffer limit. Any weird Hadoop settings like a small block size? I suspect there is something more basic going on here. Like are you sure that the line you get in your program is truly not a line in the input? you have another line here that has it as a prefix but ... is that really the same line of input? On Tue, Jun 14, 2016 at 2:04 PM, Kristoffer Sjögren wrote: > Can you do remote debugging in Spark? Didn't know that. Do you have a link? > > Also noticed isSplittable in > org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for > org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there > are some way to tell it not to split? > > On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen wrote: >> It really sounds like the line is being split across partitions. This >> is what TextInputFormat does but should be perfectly capable of >> putting together lines that break across files (partitions). If you're >> into debugging, that's where I would start if you can. Breakpoints >> around how TextInputFormat is parsing lines. See if you can catch it >> when it returns a line that doesn't contain what you expect. >> >> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren wrote: >>> That's funny. The line after is the rest of the whole line that got >>> split in half. Every following lines after that are fine. >>> >>> I managed to reproduce without gzip also so maybe it's no gzip's fault >>> after all.. >>> >>> I'm clueless... >>> >>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren >>> wrote: Seems like it's the gzip. It works if download the file, gunzip and put it back to another directory and read it the same way. Hm.. I wonder what happens with the lines after it.. On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: > What if you read it uncompressed from HDFS? > gzip compression is unfriendly to MR in that it can't split the file. > It still should just work, certainly if the line is in one file. But, > a data point worth having. > > On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren > wrote: >> The line is in one file. I did download the file manually from HDFS, >> read and decoded it line-by-line successfully without Spark. >> >> >> >> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: >>> The only thing I can think of is that a line is being broken across two >>> files? >>> Hadoop easily puts things back together in this case, or should. There >>> could be some weird factor preventing that. One first place to look: >>> are you using a weird line separator? or at least different from the >>> host OS? >>> >>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren >>> wrote: I should mention that we're in the end want to store the input from Protobuf binary to Parquet using the following code. But this comes after the lines has been decoded from base64 into binary. public static void save(JavaRDD rdd, Class clazz, String path) { try { Job job = Job.getInstance(); ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class); ProtoParquetOutputFormat.setProtobufClass(job, clazz); rdd.mapToPair(order -> new Tuple2<>(null, order)) .saveAsNewAPIHadoopFile(path, Void.class, clazz, ParquetOutputFormat.class, job.getConfiguration()); } catch (IOException e) { throw new RuntimeException(e); } } org.apache.parquet parquet-protobuf 1.8.1 On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren wrote: > I'm trying to figure out exactly what information could be useful but > it's all as straight forward. > > - It's text files > - Lines ends with a new line character. > - Files are gzipped before added to HDFS > - Files are read as gzipped files from HDFS by Spark > - There are some extra configuration > > conf.set("spark.files.overwrite", "true"); > conf.set("spark.ha
Re: Spark corrupts text lines
I cannot reproduce it by running the file through Spark in local mode on my machine. So it does indeed seems to be something related to split across partitions. On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren wrote: > Can you do remote debugging in Spark? Didn't know that. Do you have a link? > > Also noticed isSplittable in > org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for > org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there > are some way to tell it not to split? > > On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen wrote: >> It really sounds like the line is being split across partitions. This >> is what TextInputFormat does but should be perfectly capable of >> putting together lines that break across files (partitions). If you're >> into debugging, that's where I would start if you can. Breakpoints >> around how TextInputFormat is parsing lines. See if you can catch it >> when it returns a line that doesn't contain what you expect. >> >> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren wrote: >>> That's funny. The line after is the rest of the whole line that got >>> split in half. Every following lines after that are fine. >>> >>> I managed to reproduce without gzip also so maybe it's no gzip's fault >>> after all.. >>> >>> I'm clueless... >>> >>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren >>> wrote: Seems like it's the gzip. It works if download the file, gunzip and put it back to another directory and read it the same way. Hm.. I wonder what happens with the lines after it.. On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: > What if you read it uncompressed from HDFS? > gzip compression is unfriendly to MR in that it can't split the file. > It still should just work, certainly if the line is in one file. But, > a data point worth having. > > On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren > wrote: >> The line is in one file. I did download the file manually from HDFS, >> read and decoded it line-by-line successfully without Spark. >> >> >> >> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: >>> The only thing I can think of is that a line is being broken across two >>> files? >>> Hadoop easily puts things back together in this case, or should. There >>> could be some weird factor preventing that. One first place to look: >>> are you using a weird line separator? or at least different from the >>> host OS? >>> >>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren >>> wrote: I should mention that we're in the end want to store the input from Protobuf binary to Parquet using the following code. But this comes after the lines has been decoded from base64 into binary. public static void save(JavaRDD rdd, Class clazz, String path) { try { Job job = Job.getInstance(); ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class); ProtoParquetOutputFormat.setProtobufClass(job, clazz); rdd.mapToPair(order -> new Tuple2<>(null, order)) .saveAsNewAPIHadoopFile(path, Void.class, clazz, ParquetOutputFormat.class, job.getConfiguration()); } catch (IOException e) { throw new RuntimeException(e); } } org.apache.parquet parquet-protobuf 1.8.1 On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren wrote: > I'm trying to figure out exactly what information could be useful but > it's all as straight forward. > > - It's text files > - Lines ends with a new line character. > - Files are gzipped before added to HDFS > - Files are read as gzipped files from HDFS by Spark > - There are some extra configuration > > conf.set("spark.files.overwrite", "true"); > conf.set("spark.hadoop.validateOutputSpecs", "false"); > > Here's the code using Java 8 Base64 class. > > context.textFile("/log.gz") > .map(line -> line.split("×tamp=")) > .map(split -> Base64.getDecoder().decode(split[0])); > > > On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen > wrote: >> It's really the MR InputSplit code that splits files into records. >> Nothing particularly interesting happens in that process, except for >> breaking on newlines. >> >> Do you have one huge line in the file? are you reading as a text >> file? >> can you give any more detail about exactly how you parse it? it could >> be something else in your code. >> >> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren >> wrot
Re: Spark corrupts text lines
Can you do remote debugging in Spark? Didn't know that. Do you have a link? Also noticed isSplittable in org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there are some way to tell it not to split? On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen wrote: > It really sounds like the line is being split across partitions. This > is what TextInputFormat does but should be perfectly capable of > putting together lines that break across files (partitions). If you're > into debugging, that's where I would start if you can. Breakpoints > around how TextInputFormat is parsing lines. See if you can catch it > when it returns a line that doesn't contain what you expect. > > On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren wrote: >> That's funny. The line after is the rest of the whole line that got >> split in half. Every following lines after that are fine. >> >> I managed to reproduce without gzip also so maybe it's no gzip's fault >> after all.. >> >> I'm clueless... >> >> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren >> wrote: >>> Seems like it's the gzip. It works if download the file, gunzip and >>> put it back to another directory and read it the same way. >>> >>> Hm.. I wonder what happens with the lines after it.. >>> >>> >>> >>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: What if you read it uncompressed from HDFS? gzip compression is unfriendly to MR in that it can't split the file. It still should just work, certainly if the line is in one file. But, a data point worth having. On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren wrote: > The line is in one file. I did download the file manually from HDFS, > read and decoded it line-by-line successfully without Spark. > > > > On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: >> The only thing I can think of is that a line is being broken across two >> files? >> Hadoop easily puts things back together in this case, or should. There >> could be some weird factor preventing that. One first place to look: >> are you using a weird line separator? or at least different from the >> host OS? >> >> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren >> wrote: >>> I should mention that we're in the end want to store the input from >>> Protobuf binary to Parquet using the following code. But this comes >>> after the lines has been decoded from base64 into binary. >>> >>> >>> public static void save(JavaRDD rdd, Class >>> clazz, String path) { >>> try { >>> Job job = Job.getInstance(); >>> ParquetOutputFormat.setWriteSupportClass(job, >>> ProtoWriteSupport.class); >>> ProtoParquetOutputFormat.setProtobufClass(job, clazz); >>> rdd.mapToPair(order -> new Tuple2<>(null, order)) >>> .saveAsNewAPIHadoopFile(path, Void.class, clazz, >>> ParquetOutputFormat.class, job.getConfiguration()); >>> } catch (IOException e) { >>> throw new RuntimeException(e); >>> } >>> } >>> >>> >>> >>> >>> org.apache.parquet >>> parquet-protobuf >>> 1.8.1 >>> >>> >>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren >>> wrote: I'm trying to figure out exactly what information could be useful but it's all as straight forward. - It's text files - Lines ends with a new line character. - Files are gzipped before added to HDFS - Files are read as gzipped files from HDFS by Spark - There are some extra configuration conf.set("spark.files.overwrite", "true"); conf.set("spark.hadoop.validateOutputSpecs", "false"); Here's the code using Java 8 Base64 class. context.textFile("/log.gz") .map(line -> line.split("×tamp=")) .map(split -> Base64.getDecoder().decode(split[0])); On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen wrote: > It's really the MR InputSplit code that splits files into records. > Nothing particularly interesting happens in that process, except for > breaking on newlines. > > Do you have one huge line in the file? are you reading as a text file? > can you give any more detail about exactly how you parse it? it could > be something else in your code. > > On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren > wrote: >> Hi >> >> We have log files that are written in base64 encoded text files >> (gzipped) where each line is ended with a new line character. >> >> For some reason a particular line [1] is split by Spark [2] making it >> unparsable by the base64 decoder. It does this consequently no matter >> if I gi
Re: Spark corrupts text lines
It really sounds like the line is being split across partitions. This is what TextInputFormat does but should be perfectly capable of putting together lines that break across files (partitions). If you're into debugging, that's where I would start if you can. Breakpoints around how TextInputFormat is parsing lines. See if you can catch it when it returns a line that doesn't contain what you expect. On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren wrote: > That's funny. The line after is the rest of the whole line that got > split in half. Every following lines after that are fine. > > I managed to reproduce without gzip also so maybe it's no gzip's fault > after all.. > > I'm clueless... > > On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren wrote: >> Seems like it's the gzip. It works if download the file, gunzip and >> put it back to another directory and read it the same way. >> >> Hm.. I wonder what happens with the lines after it.. >> >> >> >> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: >>> What if you read it uncompressed from HDFS? >>> gzip compression is unfriendly to MR in that it can't split the file. >>> It still should just work, certainly if the line is in one file. But, >>> a data point worth having. >>> >>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren >>> wrote: The line is in one file. I did download the file manually from HDFS, read and decoded it line-by-line successfully without Spark. On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: > The only thing I can think of is that a line is being broken across two > files? > Hadoop easily puts things back together in this case, or should. There > could be some weird factor preventing that. One first place to look: > are you using a weird line separator? or at least different from the > host OS? > > On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren > wrote: >> I should mention that we're in the end want to store the input from >> Protobuf binary to Parquet using the following code. But this comes >> after the lines has been decoded from base64 into binary. >> >> >> public static void save(JavaRDD rdd, Class >> clazz, String path) { >> try { >> Job job = Job.getInstance(); >> ParquetOutputFormat.setWriteSupportClass(job, >> ProtoWriteSupport.class); >> ProtoParquetOutputFormat.setProtobufClass(job, clazz); >> rdd.mapToPair(order -> new Tuple2<>(null, order)) >> .saveAsNewAPIHadoopFile(path, Void.class, clazz, >> ParquetOutputFormat.class, job.getConfiguration()); >> } catch (IOException e) { >> throw new RuntimeException(e); >> } >> } >> >> >> >> >> org.apache.parquet >> parquet-protobuf >> 1.8.1 >> >> >> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren >> wrote: >>> I'm trying to figure out exactly what information could be useful but >>> it's all as straight forward. >>> >>> - It's text files >>> - Lines ends with a new line character. >>> - Files are gzipped before added to HDFS >>> - Files are read as gzipped files from HDFS by Spark >>> - There are some extra configuration >>> >>> conf.set("spark.files.overwrite", "true"); >>> conf.set("spark.hadoop.validateOutputSpecs", "false"); >>> >>> Here's the code using Java 8 Base64 class. >>> >>> context.textFile("/log.gz") >>> .map(line -> line.split("×tamp=")) >>> .map(split -> Base64.getDecoder().decode(split[0])); >>> >>> >>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen wrote: It's really the MR InputSplit code that splits files into records. Nothing particularly interesting happens in that process, except for breaking on newlines. Do you have one huge line in the file? are you reading as a text file? can you give any more detail about exactly how you parse it? it could be something else in your code. On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren wrote: > Hi > > We have log files that are written in base64 encoded text files > (gzipped) where each line is ended with a new line character. > > For some reason a particular line [1] is split by Spark [2] making it > unparsable by the base64 decoder. It does this consequently no matter > if I gives it the particular file that contain the line or a bunch of > files. > > I know the line is not corrupt because I can manually download the > file from HDFS, gunzip it and read/decode all the lines without > problems. > > Was thinking that maybe there is a limit to number of characters per > line but that doesn't sound right? Maybe the combination of characters > makes Spark think it's
Re: Spark corrupts text lines
That's funny. The line after is the rest of the whole line that got split in half. Every following lines after that are fine. I managed to reproduce without gzip also so maybe it's no gzip's fault after all.. I'm clueless... On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren wrote: > Seems like it's the gzip. It works if download the file, gunzip and > put it back to another directory and read it the same way. > > Hm.. I wonder what happens with the lines after it.. > > > > On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen wrote: >> What if you read it uncompressed from HDFS? >> gzip compression is unfriendly to MR in that it can't split the file. >> It still should just work, certainly if the line is in one file. But, >> a data point worth having. >> >> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren >> wrote: >>> The line is in one file. I did download the file manually from HDFS, >>> read and decoded it line-by-line successfully without Spark. >>> >>> >>> >>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen wrote: The only thing I can think of is that a line is being broken across two files? Hadoop easily puts things back together in this case, or should. There could be some weird factor preventing that. One first place to look: are you using a weird line separator? or at least different from the host OS? On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren wrote: > I should mention that we're in the end want to store the input from > Protobuf binary to Parquet using the following code. But this comes > after the lines has been decoded from base64 into binary. > > > public static void save(JavaRDD rdd, Class > clazz, String path) { > try { > Job job = Job.getInstance(); > ParquetOutputFormat.setWriteSupportClass(job, > ProtoWriteSupport.class); > ProtoParquetOutputFormat.setProtobufClass(job, clazz); > rdd.mapToPair(order -> new Tuple2<>(null, order)) > .saveAsNewAPIHadoopFile(path, Void.class, clazz, > ParquetOutputFormat.class, job.getConfiguration()); > } catch (IOException e) { > throw new RuntimeException(e); > } > } > > > > > org.apache.parquet > parquet-protobuf > 1.8.1 > > > On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren > wrote: >> I'm trying to figure out exactly what information could be useful but >> it's all as straight forward. >> >> - It's text files >> - Lines ends with a new line character. >> - Files are gzipped before added to HDFS >> - Files are read as gzipped files from HDFS by Spark >> - There are some extra configuration >> >> conf.set("spark.files.overwrite", "true"); >> conf.set("spark.hadoop.validateOutputSpecs", "false"); >> >> Here's the code using Java 8 Base64 class. >> >> context.textFile("/log.gz") >> .map(line -> line.split("×tamp=")) >> .map(split -> Base64.getDecoder().decode(split[0])); >> >> >> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen wrote: >>> It's really the MR InputSplit code that splits files into records. >>> Nothing particularly interesting happens in that process, except for >>> breaking on newlines. >>> >>> Do you have one huge line in the file? are you reading as a text file? >>> can you give any more detail about exactly how you parse it? it could >>> be something else in your code. >>> >>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren >>> wrote: Hi We have log files that are written in base64 encoded text files (gzipped) where each line is ended with a new line character. For some reason a particular line [1] is split by Spark [2] making it unparsable by the base64 decoder. It does this consequently no matter if I gives it the particular file that contain the line or a bunch of files. I know the line is not corrupt because I can manually download the file from HDFS, gunzip it and read/decode all the lines without problems. Was thinking that maybe there is a limit to number of characters per line but that doesn't sound right? Maybe the combination of characters makes Spark think it's new line? I'm clueless. Cheers, -Kristoffer [1] Original line: CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRy
Re: Spark corrupts text lines
Can you read this file using MR job ? On Tue, Jun 14, 2016 at 5:26 PM, Sean Owen wrote: > It's really the MR InputSplit code that splits files into records. > Nothing particularly interesting happens in that process, except for > breaking on newlines. > > Do you have one huge line in the file? are you reading as a text file? > can you give any more detail about exactly how you parse it? it could > be something else in your code. > > On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren > wrote: > > Hi > > > > We have log files that are written in base64 encoded text files > > (gzipped) where each line is ended with a new line character. > > > > For some reason a particular line [1] is split by Spark [2] making it > > unparsable by the base64 decoder. It does this consequently no matter > > if I gives it the particular file that contain the line or a bunch of > > files. > > > > I know the line is not corrupt because I can manually download the > > file from HDFS, gunzip it and read/decode all the lines without > > problems. > > > > Was thinking that maybe there is a limit to number of characters per > > line but that doesn't sound right? Maybe the combination of characters > > makes Spark think it's new line? > > > > I'm clueless. > > > > Cheers, > > -Kristoffer > > > > [1] Original line: > > > > > CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==×tamp=1465887564 > > > > > > [2] Line as spark hands it over: > > > > > CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0 > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards Jeff Zhang
Re: Spark corrupts text lines
It's really the MR InputSplit code that splits files into records. Nothing particularly interesting happens in that process, except for breaking on newlines. Do you have one huge line in the file? are you reading as a text file? can you give any more detail about exactly how you parse it? it could be something else in your code. On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren wrote: > Hi > > We have log files that are written in base64 encoded text files > (gzipped) where each line is ended with a new line character. > > For some reason a particular line [1] is split by Spark [2] making it > unparsable by the base64 decoder. It does this consequently no matter > if I gives it the particular file that contain the line or a bunch of > files. > > I know the line is not corrupt because I can manually download the > file from HDFS, gunzip it and read/decode all the lines without > problems. > > Was thinking that maybe there is a limit to number of characters per > line but that doesn't sound right? Maybe the combination of characters > makes Spark think it's new line? > > I'm clueless. > > Cheers, > -Kristoffer > > [1] Original line: > > CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==×tamp=1465887564 > > > [2] Line as spark hands it over: > > CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0 > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org