Hi Julien,

One quick and easy to implement idea is to use sampling on your dataset,
i.e., sample a large enough subset of your data and test is there are no
unique values on some columns. Repeat the process a few times and then do
the full test on the surviving columns.

This will allow you to load only a subset of your dataset if it is stored
in Parquet.

Best,
Anastasios

On Thu, May 31, 2018 at 10:34 AM, <julio.ces...@free.fr> wrote:

> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way to drop some useless cols for me, i.e.
> cols containing only an unique value !
>
> I want to know what do you think that I could do to do this as fast as
> possible using spark.
>
>
> I already have a solution using distinct().count() or approxCountDistinct()
> But, they may not be the best choice as this requires to go through all
> the data, even if the 2 first tested values for a col are already different
> ( and in this case I know that I can keep the col )
>
>
> Thx for your ideas !
>
> Julien
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-- Anastasios Zouzias
<a...@zurich.ibm.com>

Reply via email to