Multiple column aggregations

2019-02-08 Thread Sonu Jyotshna
Hello, I have a requirement where I need to group by multiple columns and aggregate them not at same time .. I mean I have a structure which contains accountid, some cols, order id . I need to calculate some scenarios like account having multiple orders so group by account and aggregate will work

Pyspark elementwise matrix multiplication

2019-02-08 Thread Simon Dirmeier
Dear all, I wonder if there is a way to take the elementwise-product of 2 matrices (RowMatrix, DistributedMatrix, ..) in pyspark? I cannot find a good answer/API entry on the topic. Thank you for all the help. Best, Simon

Element-wise multiplication in Pyspark

2019-02-08 Thread Simon Dirmeier
Dear all, is there a way to take the elementwise-product of 2 matrices in pyspark, e.g. RowMatrix, DistributedMatrix? I cannot find a good answer/API entry? Thanks for all the help. Best, Simon - To unsubscribe e-mail:

(send this email to subscribe)

2019-02-08 Thread Andre Carneiro
-- André Garcia Carneiro Software Engineer (11)982907780

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero
I did a repartition to 1 (hardcoded) before the keyBy and it ends in 1.2 minutes. The questions remain open, because I don't want to harcode paralellism. El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com) escribió: > 128 is the default parallelism defined for the

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero
128 is the default parallelism defined for the cluster. The question now is why keyBy operation is using default parallelism instead of the number of partition of the RDD created by the previous step (5580). Any clues? El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tuerope...@gmail.com)

Re: Aws

2019-02-08 Thread Pedro Tuero
Hi Noritaka, I start clusters from Java API. Clusters running on 5.16 have not manual configurations in the Emr console Configuration tab, so I assume the value of this property should be the default on 5.16. I enabled maximize resource allocation because otherwise, the number of cores

Spark 2.4 Regression with posexplode and structs

2019-02-08 Thread Andreas Weise
Hi, after upgrading from 2.3.2 to 2.4.0 we recognized a regression when using posexplode() in conjunction with select of another struct fields. Given a structure like this: = >>> df = (spark.range(1) ... .withColumn("my_arr", array(lit("1"), lit("2"))) ...