Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

Barona, Ricardo Fri, 09 Jun 2017 13:09:29 -0700

Thanks Manjunath, please take a look at line 64

https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsAnalysis.scala

I’m trying to get sample data but no luck for now. I will let you know if I get 
some.

Thanks.

From: "Manjunath, Kiran" <[email protected]>
Date: Friday, June 9, 2017 at 1:47 PM
To: "Barona, Ricardo" <[email protected]>, "[email protected]" 
<[email protected]>
Subject: Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) 
duplicating rows

Can you post your code and sample input?
That should help us understand if there is a bug in the code written or with 
the platform.

Regards,
Kiran

From: "Barona, Ricardo" <[email protected]>
Date: Friday, June 9, 2017 at 10:47 PM
To: "[email protected]" <[email protected]>
Subject: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) 
duplicating rows

In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I 
have a data frame with 1M+ rows and then I do:

dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row”).write.mode(SaveMode.Overwrite).text(“myHDFSFolder/results”)

then when I check for the results file, I see 900+ rows. Doing further analysis 
I found some of the rows are being duplicated.

Does anyone know if this is something that has been reported before?

The only outstanding characteristic of my data is that I have a column that 
exceeds 2000 characters.

Appreciate your help, thanks.

Cheers,
Ricardo Barona

Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

Reply via email to