multiple group by action

2018-08-24 Thread 崔苗
Hi, we have some user data with columns(userId,company,client,country,region,city), now we want to count userId by multiple column,such as : select count(distinct userId) group by company select count(distinct userId) group by company,client select count(distinct userId) group by

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-24 Thread Marcelo Vanzin
I think this would be useful, but I also share Saisai's and Marco's concern about the extra step when shutting down the application. If that could be minimized this would be a much more interesting feature. e.g. you could upload logs incrementally to HDFS, asynchronously, while the app is

Re: python tests: any reason for a huge tests.py?

2018-08-24 Thread Reynold Xin
We should break it. On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid wrote: > Hi, > > another question from looking more at python recently. Is there any > reason we've got a ton of tests in one humongous tests.py file, rather than > breaking it out into smaller files? > > Having one huge file

python tests: any reason for a huge tests.py?

2018-08-24 Thread Imran Rashid
Hi, another question from looking more at python recently. Is there any reason we've got a ton of tests in one humongous tests.py file, rather than breaking it out into smaller files? Having one huge file doesn't seem great for code organization, and it also makes the test parallelization in

Re: Spark data quality bug when reading parquet files from hive metastore

2018-08-24 Thread Driesprong, Fokko
Hi Andrew, This blog gives an idea how to schema is resolved: https://blog.godatadriven.com/multiformat-spark-partition There is some optimisation going on when reading Parquet using Spark. Hope this helps. Cheers, Fokko Op wo 22 aug. 2018 om 23:59 schreef t4 : >

Off Heap Memory

2018-08-24 Thread Jack Kolokasis
Hello,     I recently start studying the Spark's memory management system. My question is about the offHeapExecutionMemoryPool and offHeapStorageMemoryPool.     1. How Spark use the offHeapExecutionMemoryPool ?     2. How use the offHeap memory (I understand the allocation side), but it is