Hi,
we have some user data with columns(userId,company,client,country,region,city),
now we want to count userId by multiple column,such as :
select count(distinct userId) group by company
select count(distinct userId) group by company,client
select count(distinct userId) group by
I think this would be useful, but I also share Saisai's and Marco's
concern about the extra step when shutting down the application. If
that could be minimized this would be a much more interesting feature.
e.g. you could upload logs incrementally to HDFS, asynchronously,
while the app is
We should break it.
On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid
wrote:
> Hi,
>
> another question from looking more at python recently. Is there any
> reason we've got a ton of tests in one humongous tests.py file, rather than
> breaking it out into smaller files?
>
> Having one huge file
Hi,
another question from looking more at python recently. Is there any reason
we've got a ton of tests in one humongous tests.py file, rather than
breaking it out into smaller files?
Having one huge file doesn't seem great for code organization, and it also
makes the test parallelization in
Hi Andrew,
This blog gives an idea how to schema is resolved:
https://blog.godatadriven.com/multiformat-spark-partition There is some
optimisation going on when reading Parquet using Spark. Hope this helps.
Cheers, Fokko
Op wo 22 aug. 2018 om 23:59 schreef t4 :
>
Hello,
I recently start studying the Spark's memory management system. My
question is about the offHeapExecutionMemoryPool and
offHeapStorageMemoryPool.
1. How Spark use the offHeapExecutionMemoryPool ?
2. How use the offHeap memory (I understand the allocation side),
but it is