[Pyspark 2.4] Large number of row groups in parquet files created using spark

2019-07-24 Thread Rishi Shah
Hi All, I have the following code which produces 1 600MB parquet file as expected, however within this parquet file there are 42 row groups! I would expect it to crate max 6 row groups, could someone please shed some light on this? Is there any config setting which I can enable while submitting

[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
I have a Scala application in which I have added some extra rules to Catalyst. While adding some unit tests, I am trying to use some existing functions from Catalyst's test code: Specifically comparePlans() and normalizePlan() under PlanTestBase

How to get Peak CPU Utilization Rate in Spark

2019-07-24 Thread Praups Kumar
Hi Spark dev Map Reduce can be used ResourceCalculatorProcessTree in Task.java to get peakCPUUTilization. *The same is done at Yarn node manager level in class * ContainersMonitorsImpl However , I am not able to find any way to get peakCPUUtilization of the executor in spark . Please help me

Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

2019-07-24 Thread Chris Teoh
This might be a hint. Maybe invalid data? Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct>' On Wed., 24 Jul. 2019, 2:15 pm Balakumar iyer S, wrote: > Hi Bobby Evans, > > I apologise for the delayed response , yes you are right I missed out to > paste the