Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Jörn Franke
Well usually you sort only on a certain column and not on all columns so most of the columns will always be unsorted, Spark may then still need to sort if you for example join (for some joins) on an unsorted column. That being said, depending on the data you may not want to sort it, but cluster

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Li Jin
Sorry, s/ordered distributed/ordered distribution/g On Mon, Dec 4, 2017 at 10:37 AM, Li Jin wrote: > Just to give another data point: most of the data we use with Spark are > sorted on disk, having a way to allow data source to pass ordered > distributed to DataFrames is

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Li Jin
Just to give another data point: most of the data we use with Spark are sorted on disk, having a way to allow data source to pass ordered distributed to DataFrames is really useful for us. On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков wrote: > Hello, guys. > > Thank you

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Wenchen Fan
Data Source V2 is still under development. Ordering reporting is one of the planned features, but it's not done yet, we are still thinking about what the API should be, e.g. we need to include sort order, null first/last and other sorting related properties. On Mon, Dec 4, 2017 at 10:12 PM,

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Николай Ижиков
Hello, guys. Thank you for answers! > I think pushing down a sort could make a big difference. > You can however proposes to the data source api 2 to be included. Jörn, are you talking about this jira issue? - https://issues.apache.org/jira/browse/SPARK-15689 Is there any additional

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Holden Karau
I think pushing down a sort (or really more in the case where the data is already naturally returned in sorted order on some column) could make a big difference. Probably the simplest argument for a lot of time being spent sorting (in some use cases) is the fact it's still one of the standard

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Jörn Franke
I do not think that the data source api exposes such a thing. You can however proposes to the data source api 2 to be included. However there are some caveats , because sorted can mean two different things (weak vs strict order). Then, is really a lot of time lost because of sorting? The best

Spark Data Frame. PreSorded partitions

2017-12-03 Thread Николай Ижиков
Cross-posting from @user. Hello, guys! I work on implementation of custom DataSource for Spark Data Frame API and have a question: If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source. Do I have a built-in option to tell spark