Hello Ryan, On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <rb...@netflix.com> wrote: > > Andrew, can you give us more information about why partitioning the output > data doesn't work for your use case? > > It sounds like all you need to do is to create a table partitioned by A and > B, then you would automatically get the divisions you want. If what you're > looking for is a way to scale the number of combinations then you can use > formats that support more partitions, or you could sort by the fields and > rely on Parquet row group pruning to filter out data you don't want. >
TBH, I don't understand what that would look like in pyspark and what the consequences would be. Looking at the docs, it doesn't appear to be the syntax for partitioning on a condition (most of our conditions are of the form 'X > 30'). The use of Spark is still somewhat new in our field, so it's possible we're not using it correctly. Cheers Andrew > rb > > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <andrew.m...@gmail.com> wrote: >> >> Hello >> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <moein...@gmail.com> wrote: >> > >> > I've seen many application need to split dataset to multiple datasets >> > based on some conditions. As there is no method to do it in one place, >> > developers use filter method multiple times. I think it can be useful to >> > have method to split dataset based on condition in one iteration, >> > something like partition method of scala (of-course scala partition just >> > split list into two list, but something more general can be more useful). >> > If you think it can be helpful, I can create Jira issue and work on it to >> > send PR. >> >> This would be a really useful feature for our use case (processing >> collision data from the LHC). We typically want to take some sort of >> input and split into multiple disjoint outputs based on some >> conditions. E.g. if we have two conditions A and B, we'll end up with >> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the >> combinatorics explode like n^2, when we could produce them all up >> front with this "multi filter" (or however it would be called). >> >> Cheers >> Andrew >> >> > >> > Best Regards >> > Moein >> > >> > -- >> > >> > Moein Hosseini >> > Data Engineer >> > mobile: +98 912 468 1859 >> > site: www.moein.xyz >> > email: moein...@gmail.com >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > -- > Ryan Blue > Software Engineer > Netflix --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org