Re: How to persistent database/table created in sparkSession

2017-12-05 Thread Wenchen Fan
Try with `SparkSession.builder().enableHiveSupport` ? On Tue, Dec 5, 2017 at 3:22 PM, 163 wrote: > Hi, > How can I persistent database/table created in spark application? > > object TestPersistentDB { > def main(args:Array[String]): Unit = { >

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Wenchen Fan
Data Source V2 is still under development. Ordering reporting is one of the planned features, but it's not done yet, we are still thinking about what the API should be, e.g. we need to include sort order, null first/last and other sorting related properties. On Mon, Dec 4, 2017 at 10:12 PM,

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-29 Thread Wenchen Fan
+1 On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki wrote: > +1 (non-binding) > > I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for > core/sql-core/sql-catalyst/mllib/mllib-local have passed. > > $ java -version > openjdk version "1.8.0_131" >

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Wenchen Fan
My 2 cents: 1. when merging NullType with another type, the result should always be that type. 2. when merging StringType with another type, the result should always be StringType. 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-03 Thread Wenchen Fan
+1. I think this architecture makes a lot of sense to let executors talk to source/sink directly, and bring very low latency. On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen wrote: > +0 simply because I don't feel I know enough to have an opinion. I have no > reason to doubt the

Re: Dataset API Question

2017-10-25 Thread Wenchen Fan
It's because of different API design. *RDD.checkpoint* returns void, which means it mutates the RDD state so you need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed. *Dataset.checkpoint* returns a new Dataset, which means there is no isCheckpointed state in Dataset, and

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Wenchen Fan
This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1 votes. Thanks all! +1 votes (binding): Wenchen Fan Reynold Xin Cheng Liang +1 votes (non-binding): Xiao Li Weichen Xu Vaquar khan Liwei Lin Dongjoon Hyun On Tue, Oct 17, 2017 at 12:30 AM, Dongjoon Hyun <dongjoo

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Wenchen Fan
I'm adding my own +1 (binding). On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > I'm going to update the proposal: for the last point, although the > user-facing API (`df.write.format(...).option(...).mode(...).save()`) > mixes data and metadata

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Wenchen Fan
this is a good idea because of the following technical reasons. Thanks! On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > After we merge the infrastructure of data source v2 read path, and have > some discussion for the write path, now I'm

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Wenchen Fan
+1 On Tue, Oct 3, 2017 at 11:00 PM, Kazuaki Ishizaki wrote: > +1 (non-binding) > > I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for > core/sql-core/sql-catalyst/mllib/mllib-local have passed. > > $ java -version > openjdk version "1.8.0_131" >

[VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-02 Thread Wenchen Fan
Hi all, After we merge the infrastructure of data source v2 read path, and have some discussion for the write path, now I'm sending this email to call a vote for Data Source v2 write path. The full document of the Data Source API V2 is:

Re: [discuss] Data Source V2 write path

2017-10-01 Thread Wenchen Fan
so > we can introduce consistent behavior across sources for v2. > > rb > > On Thu, Sep 28, 2017 at 8:49 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> > When this CTAS logical node is turned into a physical plan, the >> relation gets turned into a `Dat

Re: [discuss] Data Source V2 write path

2017-09-28 Thread Wenchen Fan
gt; Comments inline. I've written up what I'm proposing with a bit more >> detail. >> >> On Tue, Sep 26, 2017 at 11:17 AM, Wenchen Fan <cloud0...@gmail.com> >> wrote: >> >>> I'm trying to give a summary: >>> >>> Ideally data

Re: [discuss] Data Source V2 write path

2017-09-26 Thread Wenchen Fan
ovide more details in options and do CTAS at Spark side. These can be done via options. After catalog federation, hopefully only file format data sources still use this backdoor. On Tue, Sep 26, 2017 at 8:52 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > > I think it is a bad idea to

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
guess that's not terrible. I just don't understand why it is > necessary. > > On Mon, Sep 25, 2017 at 11:26 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Catalog federation is to publish the Spark catalog API(kind of a data >> source API for metadata), so that Spark

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
If a table has no metastore > (Hadoop FS tables) then we can just pass the table metadata in when > creating the writer since there is no existence in this case. > > rb > ​ > > On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> I agree it w

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
t;r...@databricks.com> wrote: > Can there be an explicit create function? > > > On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> I agree it would be a clean approach if data source is only responsible >> to write into an already-con

Re: [discuss] Data Source V2 write path

2017-09-24 Thread Wenchen Fan
ed in the same way > like partitions are in the current format? > > On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi all, >> >> I want to have some discussion about Data Source V2 write path before >> starting a voting.

[discuss] Data Source V2 write path

2017-09-20 Thread Wenchen Fan
Hi all, I want to have some discussion about Data Source V2 write path before starting a voting. The Data Source V1 write path asks implementations to write a DataFrame directly, which is painful: 1. Exposing upper-level API like DataFrame to Data Source API is not good for maintenance. 2. Data

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-16 Thread Wenchen Fan
ental convenience. While that may be ok when >>> distributing artifacts, it's more of a problem when actually building and >>> testing artifacts. In the latter case, the download should really only be >>> from an Apache mirror. >>> >>> On Thu, Sep

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-14 Thread Wenchen Fan
That test case is trying to test the backward compatibility of `HiveExternalCatalog`. It downloads official Spark releases and creates tables with them, and then read these tables via the current Spark. About the download link, I just picked it from the Spark website, and this link is the default

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
This vote passes with 4 binding +1 votes, 10 non-binding votes, one +0 vote, and no -1 votes. Thanks all! +1 votes (binding): Wenchen Fan Herman van Hövell tot Westerflier Michael Armbrust Reynold Xin +1 votes (non-binding): Xiao Li Sameer Agarwal Suresh Thalamati Ryan Blue Xingbo Jiang

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
om:* wangzhenhua (G) <wangzhen...@huawei.com> >>> *Sent:* Friday, September 8, 2017 2:20:07 AM >>> *To:* Dongjoon Hyun; 蒋星博 >>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >>> Westerflier; Ryan Blue; Spark dev list; Suresh Thala

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-06 Thread Wenchen Fan
adding my own +1 (binding) On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > In the previous discussion, we decided to split the read and write path of > data source v2 into 2 SPIPs, and I'm sending this email to call a vote for > Dat

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Wenchen Fan
te side to a separate SPIP, too, since there > isn't much detail in the proposal and I think we should be more deliberate > with things like schema evolution. > > On Thu, Aug 31, 2017 at 10:33 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi Ryan, >> >> I

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Wenchen Fan
rdering that'd matter in the current set of pushdowns is limit - > it should always mean the root of the pushded tree. > > > On Fri, Sep 1, 2017 at 3:22 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> > Ideally also getting sort orders _after_ getting filters. >> >>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-05 Thread Wenchen Fan
+1 on the design and proposed API. One detail I'd like to discuss is the 0-parameter UDF, how we can specify the size hint. This can be done in the PR review though. On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung wrote: > +1 on this and like the suggestion of type in

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
hat >>>> the proposal says this: >>>> >>>> Ideally partitioning/bucketing concept should not be exposed in the >>>> Data Source API V2, because they are just techniques for data skipping and >>>> pre-partitioning. However, these 2 concepts are already

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
>>> consider ways to fix that problem instead of carrying the problem forward >>> to Data Source V2. We can solve this by adding a high-level API for DDL and >>> a better write/insert API that works well with it. Clearly, that discussion >>> is independent of the r

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Wenchen Fan
sting datasources leverage the cool Spark > features, and one that lets people who just want to implement basic > features do that - I'd try to include some kind of layering here. I could > probably sketch out something here if that'd be useful? > > James > > On Tue, 29 Aug 2017

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread Wenchen Fan
, and then I return the parts I can’t handle. >>> >>> I’d prefer in general that this be implemented by passing some kind of >>> query plan to the datasource which enables this kind of replacement. >>> Explicitly don’t want to give the whole query plan - that soun

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-29 Thread Wenchen Fan
Congratulations, Saisai! > On 29 Aug 2017, at 10:38 PM, Kevin Yu wrote: > > Congratulations, Jerry! > > On Tue, Aug 29, 2017 at 6:35 AM, Meisam Fathi > wrote: > Congratulations, Jerry! > > Thanks, > Meisam > > On

[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Wenchen Fan
Hi all, It has been almost 2 weeks since I proposed the data source V2 for discussion, and we already got some feedbacks on the JIRA ticket and the prototype PR, so I'd like to call for a vote. The full document of the Data Source API V2 is:

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
n't think there has really been any discussion of this api change > yet or at least it hasn't occurred on the jira ticket > > On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> adding my own +1 (binding) >> >> On Thu, Aug 17, 2017 at 9:0

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
adding my own +1 (binding) On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > Following the SPIP process, I'm putting this SPIP up for a vote. > > The current data source API doesn't work well because of some limitations > like: n

[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
Hi all, Following the SPIP process, I'm putting this SPIP up for a vote. The current data source API doesn't work well because of some limitations like: no partitioning/bucketing support, no columnar read, hard to support more operator push down, etc. I'm proposing a Data Source API V2 to

Re: How to tune the performance of Tpch query5 within Spark

2017-07-14 Thread Wenchen Fan
Try to replace your UDF with Spark built-in expressions, it should be as simple as `$”x” * (lit(1) - $”y”)`. > On 14 Jul 2017, at 5:46 PM, 163 wrote: > > I modify the tech query5 to DataFrame: > val forders = >

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Wenchen Fan
It’s not about case when, but about rand(). Non-deterministic expressions are not allowed in join condition. > On 13 Jul 2017, at 6:43 PM, wangshuang wrote: > > I'm trying to execute hive sql on spark sql (Also on spark thriftserver), For > optimizing data skew, we use "case

Re: [SQL] Return Type of Round Func

2017-07-04 Thread Wenchen Fan
Hive compatibility is not a strong requirement for Spark SQL, and for round, SQLServer also returns the same type as input, see https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql#return-types

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Wenchen Fan
+1 > On 3 Jul 2017, at 8:22 PM, Nick Pentreath wrote: > > +1 (binding) > > On Mon, 3 Jul 2017 at 11:53 Yanbo Liang > wrote: > +1 > > On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier >

Re: A question about rdd transformation

2017-06-23 Thread Wenchen Fan
The exception message should include the lineage of the un-serializable object, can you post that too?On 23 Jun 2017, at 11:23 AM, Lionel Luffy wrote:add dev list. Who can help on below question?Thanks & Best Regards,LL-- Forwarded message --From: Lionel

Re: appendix

2017-06-20 Thread Wenchen Fan
you should make hbase a data source(seems we already have hbase connector?), create a dataframe from hbase, and do join in Spark SQL. > On 21 Jun 2017, at 10:17 AM, sunerhan1...@sina.com wrote: > > Hello, > My scenary is like this: > 1.val df=hivecontext/carboncontex.sql("sql") >

Re: dataframe mappartitions problem

2017-06-20 Thread Wenchen Fan
`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, spark need to deserialize the internal binary format to type `T`, and this deserialization is costly. If you do need to do some hack, you can use the internal API: `Dataset.queryExecution.toRdd.mapPartitions`, which

Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Wenchen Fan
I'm -1 on this. I merged a PR to master/2.2 today and break the build. I'm really sorry for the trouble and I should not be so aggressive when merging PRs. The actual reason is some misleading comments in the code and a bug in Spark's testing framework

Re: When will spark 2.0 support dataset python API?

2017-05-31 Thread Wenchen Fan
We tried but didn’t get much benefits from Python Dataset, as Python is dynamic typed and there is not much we can do to optimize running python functions. > On 31 May 2017, at 3:36 AM, Cyanny LIANG wrote: > > Hi, > Since DataSet API has become a common way to process

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Wenchen Fan
see https://issues.apache.org/jira/browse/SPARK-19611 On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Whats the regression this fixed in 2.1 from 2.0? > > On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan <wenc...@databricks.com> > w

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Wenchen Fan
IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only scan all table files only once, and write back the inferred schema to metastore so that we don't need to do the schema inference again. So technically this will introduce a performance regression for the first query, but

<    1   2   3   4   5   6