Re: Efficient sampling from a Hive table

Thomas Dudziak Wed, 26 Aug 2015 09:13:23 -0700

Sorry, I meant without reading from all splits. This is a single partition
in the table.


On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak <tom...@gmail.com> wrote:

> I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from
> and I don't particularly care which rows. Doing a LIMIT unfortunately
> results in two stages where the first stage reads the whole table, and the
> second then performs the limit with a single worker, which is not very
> efficient.
> Is there a better way to sample a subset of rows in Spark without, ideally
> in a single stage without reading all partitions.
>
> cheers,
> Tom
>

Re: Efficient sampling from a Hive table

Reply via email to