Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Jonathan Winandy
11, 2019 at 6:55 AM Jonathan Winandy < > jonathan.wina...@gmail.com> wrote: > >> Hello Snehasish >> >> If you are not using UDFs, you will have very similar performance with >> those languages on SQL. >> >> So it go down to : >> * if you know pyth

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Jonathan Winandy
Hello Snehasish If you are not using UDFs, you will have very similar performance with those languages on SQL. So it go down to : * if you know python, go for python. * if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. Our team is using Scala, however we help o

Re: Thoughts on dataframe cogroup?

2019-02-25 Thread Jonathan Winandy
For info, in our team have defined our own cogroup on dataframe in the past on different projects using different methods (rdd[row] based or union all collect list based). I might be biased, but find the approach very useful in project to simplify and speed up transformations, and remove a lot of

Re: Spark madness

2017-05-22 Thread Jonathan Winandy
Hi Saikat, You may use the wrong mailing list for your question (=> spark user). If you want to make a single string, it's : red.collect.mkString("\n") Be careful of driver explosion ! Cheers, Jonathan On Fri, 19 May 2017, 05:21 Saikat Kanjilal, wrote: > One additional point, the following l

Re:

2015-08-06 Thread Jonathan Winandy
of n and I think it parallelise nicely for large values. Please tell me what you think. Have a nice day, Jonathan On 5 August 2015 at 19:18, Jonathan Winandy wrote: > Hello ! > > You could try something like that : > > def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boo

Re:

2015-08-05 Thread Jonathan Winandy
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boolean = { val context: SparkContext = rdd.sparkContext val grp: String = Random.alphanumeric.take(10).mkString context.setJobGroup(grp, "exist") val count: Accumulator[Long] = context.accumulato

Re: New Feature Request

2015-07-31 Thread Jonathan Winandy
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low > n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten Sc

Re: Converting DataFrame to RDD of case class

2015-07-27 Thread Jonathan Winandy
Hello ! Can both methods be compare in term of performance ? Tried the pull request and it felt slow compare to manual mapping. Cheers, Jonathan On Mon, Jul 27, 2015, 8:51 PM Reynold Xin wrote: > There is this pull request: https://github.com/apache/spark/pull/5713 > > We mean to merge it for

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Jonathan Winandy
ering each n-uples of each column >> value as the key (which is what the groupBy is doing by default). >> >> Regards, >> >> Olivier >> >> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy : >> >>> Ahoy ! >>> >>> Maybe you can get

Re: countByValue on dataframe with multiple columns

2015-07-20 Thread Jonathan Winandy
Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType df.groupBy("n").count().show() // generic def countByValueDf(df: