sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos
Hello, I have a text file consisting of 483150 lines (wc -l "my_file.txt"). However when I read it using textFile: %pyspark rdd = sc.textFile("my_file.txt") print rdd.count() it returns 554420 lines. Any idea why this is happening? Is it using a different new line delimiter and how this can be

Re: sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos
Found the problem. Control-M characters. Please ignore the post On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos wrote: > Hello, > > I have a text file consisting of 483150 lines (wc -l "my_file.txt"). > > However when I read it using textFile: > > %pyspark > rdd =