11, 2019 at 6:55 AM Jonathan Winandy <
> jonathan.wina...@gmail.com> wrote:
>
>> Hello Snehasish
>>
>> If you are not using UDFs, you will have very similar performance with
>> those languages on SQL.
>>
>> So it go down to :
>> * if you know pyth
Hello Snehasish
If you are not using UDFs, you will have very similar performance with
those languages on SQL.
So it go down to :
* if you know python, go for python.
* if you are used to the JVM, and are ready for a bit of paradigm shift, go
for Scala.
Our team is using Scala, however we help o
For info, in our team have defined our own cogroup on dataframe in the past
on different projects using different methods (rdd[row] based or union all
collect list based).
I might be biased, but find the approach very useful in project to simplify
and speed up transformations, and remove a lot of
Hi Saikat,
You may use the wrong mailing list for your question (=> spark user).
If you want to make a single string, it's :
red.collect.mkString("\n")
Be careful of driver explosion !
Cheers,
Jonathan
On Fri, 19 May 2017, 05:21 Saikat Kanjilal, wrote:
> One additional point, the following l
of n and I think it parallelise
nicely for large values.
Please tell me what you think.
Have a nice day,
Jonathan
On 5 August 2015 at 19:18, Jonathan Winandy
wrote:
> Hello !
>
> You could try something like that :
>
> def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boo
Hello !
You could try something like that :
def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boolean = {
val context: SparkContext = rdd.sparkContext
val grp: String = Random.alphanumeric.take(10).mkString
context.setJobGroup(grp, "exist")
val count: Accumulator[Long] = context.accumulato
Hello !
You could try something like that :
def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Int):Boolean = {
rdd.filter(f).countApprox(timeout = 1).getFinalValue().low > n
}
If would work for large datasets and large value of n.
Have a nice day,
Jonathan
On 31 July 2015 at 11:29, Carsten Sc
Hello !
Can both methods be compare in term of performance ? Tried the pull request
and it felt slow compare to manual mapping.
Cheers,
Jonathan
On Mon, Jul 27, 2015, 8:51 PM Reynold Xin wrote:
> There is this pull request: https://github.com/apache/spark/pull/5713
>
> We mean to merge it for
ering each n-uples of each column
>> value as the key (which is what the groupBy is doing by default).
>>
>> Regards,
>>
>> Olivier
>>
>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy :
>>
>>> Ahoy !
>>>
>>> Maybe you can get
Ahoy !
Maybe you can get countByValue by using sql.GroupedData :
// some DFval df: DataFrame =
sqlContext.createDataFrame(sc.parallelize(List("A","B", "B",
"A")).map(Row.apply(_)), StructType(List(StructField("n",
StringType
df.groupBy("n").count().show()
// generic
def countByValueDf(df:
10 matches
Mail list logo