Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Russell Spitzer
They are based on a physical column, the column is real. The function just only exists in the datasource. For example Select ttl(a), ttl(b) FROM table ks.tab On Tue, Sep 4, 2018 at 11:16 PM Reynold Xin wrote: > Russell your special columns wouldn’t actually work with option 1 because > Spark

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Reynold Xin
Russell your special columns wouldn’t actually work with option 1 because Spark would have to fail them in analysis without an actual physical column. On Tue, Sep 4, 2018 at 9:12 PM Russell Spitzer wrote: > I'm a big fan of 1 as well. I had to implement something similar using > custom

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Russell Spitzer
I'm a big fan of 1 as well. I had to implement something similar using custom expressions and it was a bit more work than it should be. In particular our use case is that columns have certain metadata (ttl, writetime) which exist not as separate columns but as special values which can be surfaced.

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Ryan Blue
Thanks for posting the summary. I'm strongly in favor of option 1. I think that API footprint is fairly small, but worth it. Not only does it make sources easier to implement by handling parsing, it also makes sources more reliable because Spark handles validation the same way across sources. A

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Wenchen Fan
+ Liang-Chi and Herman, I think this is a common requirement to get top N records. For now we guarantee it by the `TakeOrderedAndProject` operator. However, this operator may not be used if the spark.sql.execution.topKSortFallbackThreshold config has a small value. Shall we reconsider

Re: data source api v2 refactoring

2018-09-04 Thread Wenchen Fan
I'm switching to my another Gmail account, let's see if it still gets dropped this time. Hi Ryan, I'm thinking about the write path and feel the abstraction should be the same. We still have logical and physical writing. And the table can create different logical writing based on how to write.

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Chetan Khatri
Thanks On Wed 5 Sep, 2018, 2:15 AM Russell Spitzer, wrote: > RDD: Top > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T > ] > Which is pretty much what Sean suggested > > For Dataframes I think doing a order and

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Reynold Xin
Ryan, Michael and I discussed this offline today. Some notes here: His use case is to support partitioning data by derived columns, rather than physical columns, because he didn't want his users to keep adding the "date" column when in reality they are purely derived from some timestamp column.

Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-09-04 Thread Matt Cheah
Yuanjian, Thanks for sharing your progress! I was wondering if there was any prototype code that we could read to get an idea of what the implementation looks like? We can evaluate the design together and also benchmark workloads from across the community – that is, we can collect more data

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Russell Spitzer
RDD: Top http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T ] Which is pretty much what Sean suggested For Dataframes I think doing a order and limit would be equivalent after optimizations. On Tue, Sep 4, 2018 at 2:28

Fwd: data source api v2 refactoring

2018-09-04 Thread Ryan Blue
Latest from Wenchen in case it was dropped. -- Forwarded message - From: Wenchen Fan Date: Mon, Sep 3, 2018 at 6:16 AM Subject: Re: data source api v2 refactoring To: Cc: Ryan Blue , Reynold Xin , < dev@spark.apache.org> Hi Mridul, I'm not sure what's going on, my email was

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Sean Owen
Sort and take head(n)? On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri wrote: > Dear Spark dev, anything equivalent in spark ? >

Select top (100) percent equivalent in spark

2018-09-04 Thread Chetan Khatri
Dear Spark dev, anything equivalent in spark ?

[ML] Setting Non-Transform Params for a Pipeline & PipelineModel

2018-09-04 Thread Aleksander Eskilson
In a nutshell, what I'd like to do is persist a instantiate a Pipeline (or extension class of Pipeline) with metadata that is copied to the PipelineModel when fitted, and can be read again when the fitted model is loaded by another consumer. These params are specific to the PipelineModel more than

Re: data source api v2 refactoring

2018-09-04 Thread Marcelo Vanzin
Same here, I don't see anything from Wenchen... just replies to him. On Sat, Sep 1, 2018 at 9:31 PM Mridul Muralidharan wrote: > > > Is it only me or are all others getting Wenchen’s mails ? (Obviously Ryan did > :-) ) > I did not see it in the mail thread I received or in archives ... [1] >

Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-09-04 Thread shane knapp
the docs and publishing builds need some attention... i was planning on looking in to this after the 2.4 cut and ubuntu port is a little further along. see: https://amplab.cs.berkeley.edu/jenkins/label/spark-docs/ https://amplab.cs.berkeley.edu/jenkins/label/spark-packaging/

Re: Jenkins automatic disabling service - who and why?

2018-09-04 Thread shane knapp
fyi, i haven't upgraded jenkins in a couple of years... (yeah, i know... it's on my todo list). i'm just assuming that it's an artifact of old PRs going 'stale' somehow, but since that's not mentioned anywhere in the plugin docs i wouldn't bet good money on that. :) On Mon, Sep 3, 2018 at

Re: Spark JIRA tags clarification and management

2018-09-04 Thread Kazuaki Ishizaki
Of course, we would like to eliminate all of the following tags "flanky" or "flankytest" Kazuaki Ishizaki From: Hyukjin Kwon To: dev Cc: Xiao Li , Wenchen Fan Date: 2018/09/04 14:20 Subject:Re: Spark JIRA tags clarification and management Thanks, Reynold. +Adding