Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

2017-06-09 Thread Barona, Ricardo
Thanks Manjunath, please take a look at line 64 https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/proxy/ProxySuspiciousConnectsAnalysis.scala I’m trying to get sample data but no luck for now. I will let you know if I get some. Thanks. From:

Re: RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

2017-06-09 Thread Manjunath, Kiran
Can you post your code and sample input? That should help us understand if there is a bug in the code written or with the platform. Regards, Kiran From: "Barona, Ricardo" Date: Friday, June 9, 2017 at 10:47 PM To: "user@spark.apache.org"

RDD saveAsText and DataFrame write.mode(SaveMode).text(Path) duplicating rows

2017-06-09 Thread Barona, Ricardo
In Spark 1.6.0 I’m having an issue with saveAsText and write.mode.text where I have a data frame with 1M+ rows and then I do: dataFrame.limit(500).map(_.mkString(“\t”)).toDF(“row”).write.mode(SaveMode.Overwrite).text(“myHDFSFolder/results”) then when I check for the results file, I see 900+

Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Debabrata Ghosh
Thanks a lot for your quick help ! Further, I have 2 more points: a) I heard from my colleagues that if my Scala code had RDD then I need to replace with datasets / dataframes. Why is that ? b) One of the operator saveasTextFile is taking a long time. What would be the probable cause and

Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Gerard Maas
also, read the newest book of Holden on High-Performance Spark: http://shop.oreilly.com/product/0636920046967.do On Fri, Jun 9, 2017 at 5:38 PM, Alonso Isidoro Roman wrote: > a quick search on google: > > https://www.cloudera.com/documentation/enterprise/5-9- >

Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Alonso Isidoro Roman
a quick search on google: https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_spark_tuning.html https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ and of course,

Re: StructuredStreaming : StreamingQueryException

2017-06-09 Thread aravias
the bug is related to where long checkpoints are truncated when dealing with topics have large number of partitions, in my case 120. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StructuredStreaming-StreamingQueryException-tp28749p28754.html Sent from the

Re: StructuredStreaming : StreamingQueryException

2017-06-09 Thread aravias
this is a bug in spark version 2.1.0, seems to be fixed in spark 2.1.1 when ran with that version. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StructuredStreaming-StreamingQueryException-tp28749p28753.html Sent from the Apache Spark User List mailing

RowMatrix: tallSkinnyQR

2017-06-09 Thread Arun
hi *def tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix, Matrix]* *In output of this method Q is distributed matrix* *and R is local Matrix* *Whats the reason R is Local Matrix?* -Arun

Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Debabrata Ghosh
Hi, I need some help / guidance in performance tuning Spark code written in Scala. Can you please help. Thanks

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-09 Thread Chanh Le
Hi Takeshi, Thank you very much. Regards, Chanh On Thu, Jun 8, 2017 at 11:05 PM Takeshi Yamamuro wrote: > I filed a jira about this issue: > https://issues.apache.org/jira/browse/SPARK-21024 > > On Thu, Jun 8, 2017 at 1:27 AM, Chanh Le wrote: > >>

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-09 Thread Даша Ковальчук
Hello, Ranadip! I tried your solution, but still have no results. Also I didn’t find anything in logs. Kerberos disabled, dfs.permissions = false. Thanks. 2017-06-08 20:52 GMT+03:00 Ranadip Chatterjee : > Looks like your session user does not have the required privileges on

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-09 Thread Даша Ковальчук
Thank you for your response. Yes, I tried this solution, and it works fine, but this solution for collocated hive cluster. I need to query more then one remote clusters in one spark session, and due to this I need to use connection over jdbc. Maybe you know how to query more then one remote server