Re: Feature request: split dataset based on condition

Andrew Melo Mon, 04 Feb 2019 13:48:39 -0800

Hi Ryan,

On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue <[email protected]> wrote:
>
> To partition by a condition, you would need to create a column with the 
> result of that condition. Then you would partition by that column. The sort 
> option would also work here.


We actually do something similar to filter based on physics properties
by running a python UDF to create a column then filtering on that
column. Doing something similar to sort/partition would also require a
shuffle though, right?

>
> I don't think that there is much of a use case for this. You have a set of 
> conditions on which to partition your data, and partitioning is already 
> supported. The idea to use conditions to create separate data frames would 
> actually make that harder because you'd need to create and name tables for 
> each one.

At the end, however, we do need separate dataframes for each of these
subsamples, unless there's something basic I'm missing in how the
partitioning works. After the input datasets are split into signal and
background regions, we still need to perform further (different)
computations on each of the subsamples. e.g. for subsamples with
exactly 2 electrons, we'll need to calculate the sum of their 4-d
momenta, while samples with <2 electrons will need subtract two
different physical quantities -- several more steps before we get to
the point where we'll histogram the different subsamples for the
outputs.

Cheers
Andrew


>
> On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo <[email protected]> wrote:
>>
>> Hello Ryan,
>>
>> On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <[email protected]> wrote:
>> >
>> > Andrew, can you give us more information about why partitioning the output 
>> > data doesn't work for your use case?
>> >
>> > It sounds like all you need to do is to create a table partitioned by A 
>> > and B, then you would automatically get the divisions you want. If what 
>> > you're looking for is a way to scale the number of combinations then you 
>> > can use formats that support more partitions, or you could sort by the 
>> > fields and rely on Parquet row group pruning to filter out data you don't 
>> > want.
>> >
>>
>> TBH, I don't understand what that would look like in pyspark and what
>> the consequences would be. Looking at the docs, it doesn't appear to
>> be the syntax for partitioning on a condition (most of our conditions
>> are of the form 'X > 30'). The use of Spark is still somewhat new in
>> our field, so it's possible we're not using it correctly.
>>
>> Cheers
>> Andrew
>>
>> > rb
>> >
>> > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <[email protected]> wrote:
>> >>
>> >> Hello
>> >>
>> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <[email protected]> wrote:
>> >> >
>> >> > I've seen many application need to split dataset to multiple datasets 
>> >> > based on some conditions. As there is no method to do it in one place, 
>> >> > developers use filter method multiple times. I think it can be useful 
>> >> > to have method to split dataset based on condition in one iteration, 
>> >> > something like partition method of scala (of-course scala partition 
>> >> > just split list into two list, but something more general can be more 
>> >> > useful).
>> >> > If you think it can be helpful, I can create Jira issue and work on it 
>> >> > to send PR.
>> >>
>> >> This would be a really useful feature for our use case (processing
>> >> collision data from the LHC). We typically want to take some sort of
>> >> input and split into multiple disjoint outputs based on some
>> >> conditions. E.g. if we have two conditions A and B, we'll end up with
>> >> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> >> combinatorics explode like n^2, when we could produce them all up
>> >> front with this "multi filter" (or however it would be called).
>> >>
>> >> Cheers
>> >> Andrew
>> >>
>> >> >
>> >> > Best Regards
>> >> > Moein
>> >> >
>> >> > --
>> >> >
>> >> > Moein Hosseini
>> >> > Data Engineer
>> >> > mobile: +98 912 468 1859
>> >> > site: www.moein.xyz
>> >> > email: [email protected]
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: [email protected]
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Feature request: split dataset based on condition

Reply via email to