Re: SPIP: Catalog API for view metadata

2020-11-10 Thread Ryan Blue
An extra RPC call is a concern for the catalog implementation. It is simple
to cache the result of a call to avoid a second one if the catalog chooses.

I don't think that an extra RPC that can be easily avoided is a reasonable
justification to add caches in Spark. For one thing, it doesn't solve the
problem because the proposed API still requires separate lookups for tables
and views.

The only solution that would help is to use a combined trait, but that has
issues. For one, view substitution is much cleaner when it happens well
before table resolution. And, View and Table are very different objects;
returning Object from this API doesn't make much sense.

One extra RPC is not unreasonable, and the choice should be left to
sources. That's the easiest place to cache results from the underlying
store.

On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:

> Moving back the discussion to this thread. The current argument is how to
> avoid extra RPC calls for catalogs supporting both table and view. There
> are several options:
> 1. ignore it as extra PRC calls are cheap compared to the query execution
> 2. have a per session cache for loaded table/view
> 3. have a per query cache for loaded table/view
> 4. add a new trait TableViewCatalog
>
> I think it's important to avoid perf regression with new APIs. RPC calls
> can be significant for short queries. We may also double the RPC
> traffic which is bad for the metastore service. Normally I would not
> recommend caching as cache invalidation is a hard problem. Personally I
> prefer option 4 as it only affects catalogs that support both table and
> view, and it fits the hive catalog very well.
>
> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:
>
>> SPIP
>> 
>> has been updated. Please review.
>>
>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>>
>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>
>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:
>>>
 Any updates here? I agree that a new View API is better, but we need a
 solution to avoid performance regression. We need to elaborate on the cache
 idea.

 On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:

> I think it is a good idea to keep tables and views separate.
>
> The main two arguments I’ve heard for combining lookup into a single
> function are the ones brought up in this thread. First, an identifier in a
> catalog must be either a view or a table and should not collide. Second, a
> single lookup is more likely to require a single RPC. I think the RPC
> concern is well addressed by caching, which we already do in the Spark
> catalog, so I’ll primarily focus on the first.
>
> Table/view name collision is unlikely to be a problem. Metastores that
> support both today store them in a single namespace, so this is not a
> concern for even a naive implementation that talks to the Hive MetaStore. 
> I
> know that a new metastore catalog could choose to implement both
> ViewCatalog and TableCatalog and store the two sets separately, but that
> would be a very strange choice: if the metastore itself has different
> namespaces for tables and views, then it makes much more sense to expose
> them through separate catalogs because Spark will always prefer one over
> the other.
>
> In a similar line of reasoning, catalogs that expose both views and
> tables are much more rare than catalogs that only expose one. For example,
> v2 catalogs for JDBC and Cassandra expose data through the Table interface
> and implementing ViewCatalog would make little sense. Exposing new data
> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
> likely to be the same. Say I have a way to convert Pig statements or some
> other representation into a SQL view. It would make little sense to 
> combine
> that with some other TableCatalog.
>
> I also don’t think there is benefit from an API perspective to justify
> combining the Table and View interfaces. The two share only schema and
> properties, and are handled very differently internally — a View’s SQL
> query is parsed and substituted into the plan, while a Table is wrapped in
> a relation that eventually becomes a Scan node using SupportsRead. A 
> view’s
> SQL also needs additional context to be resolved correctly: the current
> catalog and namespace from the time the view was created.
>
> Query planning is distinct between tables and views, so Spark doesn’t
> benefit from combining them. I think it has actually caused problems that
> both were resolved by the same method in v1: the resolution rule grew
> extremely complicated trying to look up a reference just once because it
> had to parse a view plan and resolve relations within it 

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Ryan Blue
+1, I agree with Tom.

On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 
wrote:

> +1 for Apache Spark 3.1.0.
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
> wrote:
>
>> +1 since its a correctness issue, I think its ok to change the behavior
>> to make sure the user is aware of it and let them decide.
>>
>> Tom
>>
>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>> vii...@gmail.com> wrote:
>>
>>
>> Hi devs,
>>
>> In Spark structured streaming, chained stateful operators possibly
>> produces
>> incorrect results under the global watermark. SPARK-33259
>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>> demostrating what the correctness issue could be.
>>
>> Currently we don't prevent users running such queries. Because the
>> possible
>> correctness in chained stateful operators in streaming query is not
>> straightforward for users. From users perspective, it will possibly be
>> considered as a Spark bug like SPARK-33259. It is also possible the worse
>> case, users are not aware of the correctness issue and use wrong results.
>>
>> IMO, it is better to disable such queries and let users choose to run the
>> query if they understand there is such risk, instead of implicitly running
>> the query and let users to find out correctness issue by themselves.
>>
>> I would like to propose to disable the streaming query with possible
>> correctness issue in chained stateful operators. The behavior can be
>> controlled by a SQL config, so if users understand the risk and still want
>> to run the query, they can disable the check.
>>
>> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
>> for now is, this changes current behavior and by default it will break
>> some
>> existing streaming queries. But I think it is pretty easy to disable the
>> check with the new config. In the PR currently there is no objection but
>> suggestion to hear more voices. Please let me know if you have some
>> thoughts.
>>
>> Thanks.
>> Liang-Chi Hsieh
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Dongjoon Hyun
+1 for Apache Spark 3.1.0.

Bests,
Dongjoon.

On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
wrote:

> +1 since its a correctness issue, I think its ok to change the behavior to
> make sure the user is aware of it and let them decide.
>
> Tom
>
> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
> vii...@gmail.com> wrote:
>
>
> Hi devs,
>
> In Spark structured streaming, chained stateful operators possibly produces
> incorrect results under the global watermark. SPARK-33259
> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
> demostrating what the correctness issue could be.
>
> Currently we don't prevent users running such queries. Because the possible
> correctness in chained stateful operators in streaming query is not
> straightforward for users. From users perspective, it will possibly be
> considered as a Spark bug like SPARK-33259. It is also possible the worse
> case, users are not aware of the correctness issue and use wrong results.
>
> IMO, it is better to disable such queries and let users choose to run the
> query if they understand there is such risk, instead of implicitly running
> the query and let users to find out correctness issue by themselves.
>
> I would like to propose to disable the streaming query with possible
> correctness issue in chained stateful operators. The behavior can be
> controlled by a SQL config, so if users understand the risk and still want
> to run the query, they can disable the check.
>
> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
> for now is, this changes current behavior and by default it will break some
> existing streaming queries. But I think it is pretty easy to disable the
> check with the new config. In the PR currently there is no objection but
> suggestion to hear more voices. Please let me know if you have some
> thoughts.
>
> Thanks.
> Liang-Chi Hsieh
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Hive isolation and context classloaders

2020-11-10 Thread Steve Loughran
I'm staring at https://issues.apache.org/jira/browse/HADOOP-17372 and a
stack trace which claims that a com.amazonaws class doesn't implement an
interface which it very much does

2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite]
WARN  fs.FileSystem (FileSystem.java:createFileSystem(3466)) - Failed to
initialize fileystem s3a://stevel-ireland: java.io.IOException: Class class
com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not
implement AWSCredentialsProvider
- DataFrames *** FAILED ***
  org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.io.IOException: Class class
com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not
implement AWSCredentialsProvider;
  at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)

This is happening because Hive wants to instantiate the FS for a filesystem
cluster (full stack in the JIRA for the curious)

FileSystem.get(startSs.sessionConf);


The cluster FS is set to be S3, the s3a code is building up its list of
credential providers via a configuration lookup

conf.getClasses("fs.s3a.aws.credentials.provider",
  "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
  org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
  com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
  org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider

followed by a validation that whatever was loaded can be passed into the
AWS SDK

if (!AWSCredentialsProvider.class.isAssignableFrom(credClass)) {
  throw new IOException("Class " + credClass + " " + NOT_AWS_PROVIDER);
}

What appears to be happening is that the loading of the AWS credential
provider is failing because that is loaded in a configuration based of the
HiveConf, which uses the context class loader which was used to create that
conf, so the AWS SDK class EnvironmentVariableCredentialsProvider is being
loaded in the isolated classloader. But S3AFilesystem, being
org.apache.hadoop code, is loading in the base classloader. As a result, it
doesn't consider the EnvironmentVariableCredentialsProvider to implement
the credential provider API.

What to do?

I could make this specific issue evaporate by just subclassing the aws SDK
credential providers somewhere in o.a.h.fs.s3a and putting them on the
default list, but that leaves the issue lurking for anyone else and for
some other configuration-driven extension points. Anyone who uses the
plugin options for the S3A and abfs connectors MUST use a class beginning
org.apache.hadoop or they won't be able to init hive.

Alternatively, I could ignore the context classloader and make the
Configuration.getClasses() method use whatever classloader loaded the
actual S3AFileSystem class. I worry that if I do that, something else is
going to go horriby wrong somewhere completely random in the future. Which
anything going near classloaders inevitably does, at some point.

Suggestions?


Draft ASF board report for November

2020-11-10 Thread Matei Zaharia
Hi all,

It’s time to send in our quarterly ASF board report on Nov 11, so I wanted to 
include anything notable going on that we want to appear in the board archive. 
Here is my draft; let me know if you have suggested changes.

===

Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python, R and SQL as well as a rich set 
of libraries including stream processing, machine learning, and graph analytics.

Project status:

- We releases Apache Spark 3.0.1 on September 8th and Spark 2.4.7 on September 
12th as maintenance releases with bug fixes to these two branches.

- The community is working on a number of new features in the Spark 3.x branch, 
including improved data catalog APIs, a push-based shuffle implementation, and 
better error messages to make Spark applications easier to debug. The largest 
changes have are being discussed as SPIPs on our mailing list.

Trademarks:

- One of the two software projects we reached out to July to change its name 
due to a trademark issue has changed it. We are still waiting for a reply from 
the other one, but it may be that development there has stopped.

Latest releases:

- Spark 2.4.7 was released on September 12th, 2020.
- Spark 3.0.1 was released on September 8th, 2020.
- Spark 3.0.0 was released on June 18th, 2020.

Committers and PMC:

- The latest committers were added on July 14th, 2020 (Huaxin Gao, Jungtaek Lim 
and Dilip Biswal).
- The latest PMC member was added on Sept 4th, 2019 (Dongjoon Hyun). The PMC 
has been discussing some new candidates to add as PMC members.
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Tom Graves
 +1 since its a correctness issue, I think its ok to change the behavior to 
make sure the user is aware of it and let them decide.
Tom
On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh 
 wrote:  
 
 Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
(https://issues.apache.org/jira/browse/SPARK-33259) has an example
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org