Found the problem. Control-M characters. Please ignore the post On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos <sigle...@textkernel.nl> wrote:
> Hello, > > I have a text file consisting of 483150 lines (wc -l "my_file.txt"). > > However when I read it using textFile: > > %pyspark > rdd = sc.textFile("my_file.txt") > print rdd.count() > > it returns 554420 lines. Any idea why this is happening? Is it using a > different new line delimiter and how this can be changed? > > Thank you, > George > > > > >