SPIP to improve online serving of Spark MLLib Models

2018-12-03 Thread aholler
Hi, Folks, I have filed a JIRA ticket (https://issues.apache.org/jira/browse/SPARK-26247) for an SPIP on improving the model load latency and serving interfaces for MLLib model online serving, as discussed with Joseph Bradley and with Felix Cheung as the SPIP Shepherd. The associated SPIP doc is l

Re: DataSourceV2 community sync #3

2018-12-03 Thread Thakrar, Jayesh
Thank you Ryan and Xiao – sharing all this info really gives a very good insight! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Monday, December 3, 2018 at 12:05 PM To: "Thakrar, Jayesh" Cc: Xiao Li , Spark Dev List Subject: Re: DataSourceV2 community sync #3 Jayesh, I don’t think th

Re: DataSourceV2 community sync #3

2018-12-03 Thread Ryan Blue
Jayesh, I don’t think this need is very narrow. To have reliable behavior for CTAS, you need to: 1. Check whether a table exists and fail. Right now, it is up to the source whether to continue with the write if the table already exists or to throw an exception, which is unreliable acros

Re: DataSourceV2 community sync #3

2018-12-03 Thread Thakrar, Jayesh
Thank you Xiao – I was wondering what was the motivation for the catalog. If CTAS is the only candidate, would it suffice to have that as part of the data source interface only? If we look at BI, ETL and reporting tools which interface with many tables from different data sources at the same tim

Re: DataSourceV2 community sync #3

2018-12-03 Thread Ryan Blue
Jayesh, The current catalog in Spark is a little weird. It uses a Hive catalog and adds metadata that only Spark understands to track tables, in addition to regular Hive tables. Some of those tables are actually just pointers to tables that exist in some other source of truth. This is what makes t

Re: DataSourceV2 community sync #3

2018-12-03 Thread Ryan Blue
Do you agree on my definition of catalog in Spark SQL? I think we agree on what a catalog is: A service that can manage the metadata and definitions of databases, views, tables, functions, roles, etc. external objects accessed through our data source APIs are called “tables”. I do not think we wi

Re: DataSourceV2 community sync #3

2018-12-03 Thread Xiao Li
Hi, Jayesh, This is a good question. Spark is a unified analytics engine for various data sources. We are able to get the table schema from the underlying data sources via our data source APIs. Thus, it resolves most of the user requirements. Spark does not need the other info (like database, func