Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I managed to get remote debugging up and running and can in fact reproduce the error and get a breakpoint triggered as it happens. But it seems like the code does not go through TextInputFormat, or at least the breakpoint is not triggered from this class? Don't know what other class to look for

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I'm pretty confident the lines are encoded correctly since I can read them both locally and on Spark (by ignoring the faulty line and proceed to next). I also get the correct number of lines through Spark, again by ignoring the faulty line. I get the same error by reading the original file using

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Thanks for you help. Really appreciate it! Give me some time i'll come back after I've tried your suggestions. On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren wrote: > I cannot reproduce it by running the file through Spark in local mode > on my machine. So it does indeed

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen
It takes a little setup, but you can do remote debugging: http://danosipov.com/?p=779 ... and then use similar config to connect your IDE to a running executor. Before that you might strip your program down to only a call to textFile that then checks the lines according to whatever logic would

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I cannot reproduce it by running the file through Spark in local mode on my machine. So it does indeed seems to be something related to split across partitions. On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren wrote: > Can you do remote debugging in Spark? Didn't know that.

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Can you do remote debugging in Spark? Didn't know that. Do you have a link? Also noticed isSplittable in org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there are some way to tell it not to split? On Tue, Jun

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen
It really sounds like the line is being split across partitions. This is what TextInputFormat does but should be perfectly capable of putting together lines that break across files (partitions). If you're into debugging, that's where I would start if you can. Breakpoints around how TextInputFormat

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
That's funny. The line after is the rest of the whole line that got split in half. Every following lines after that are fine. I managed to reproduce without gzip also so maybe it's no gzip's fault after all.. I'm clueless... On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren

Re: Spark corrupts text lines

2016-06-14 Thread Jeff Zhang
Can you read this file using MR job ? On Tue, Jun 14, 2016 at 5:26 PM, Sean Owen wrote: > It's really the MR InputSplit code that splits files into records. > Nothing particularly interesting happens in that process, except for > breaking on newlines. > > Do you have one

Re: Spark corrupts text lines

2016-06-14 Thread Sean Owen
It's really the MR InputSplit code that splits files into records. Nothing particularly interesting happens in that process, except for breaking on newlines. Do you have one huge line in the file? are you reading as a text file? can you give any more detail about exactly how you parse it? it

Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Hi We have log files that are written in base64 encoded text files (gzipped) where each line is ended with a new line character. For some reason a particular line [1] is split by Spark [2] making it unparsable by the base64 decoder. It does this consequently no matter if I gives it the