Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-19 Thread JackyLee
+1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: spark-on-k8s is still experimental?

2020-08-03 Thread JackyLee
+1. It has been worked well in our company and we has used it to support
online services since March in this year.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Catalog API for Partition

2020-07-21 Thread JackyLee
The `partitioning` in `TableCatalog.createTable` is a partition schema for
table, which doesn't contains the partition metadata for an actual
partition. Besides, the actual partition metadata may contains many
partition schema, such as hive partition. 
Thus I created a `TablePartition` to contains the partition metadata which
can be distinguished from Transform and created `Partition Catalog APIs` to
manage partition metadata.

In short, the `TablePartition` contains multiple `Transform` and partition
metadata for an actual partition in hive or other datasource.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Catalog API for Partition

2020-07-17 Thread JackyLee
Hi, wenchen. Thanks for your attention and reply.

Firstly. These Partition Catalog APIs are not specially used for hive, they
can be used with LakeHouse or myql or other source support partitions.
Secondly. These Partition Catalog APIs are only designed for better data
management, not for speed up data scan. The API used to speed up hive data
scan are different from these APIs.

Currently, we use Hive Catalog APIs to support speeding hive data scan and
write data into hive. However, we are trying to redefine HiveTable, which
implements FileTable, and use PartitioningPruning to support speed up hive
scan. Privately, I think this is a better way to support hive in
datasourcev2.

Thanks again.
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Catalog API for Partition

2020-07-16 Thread JackyLee
Hi devs,

In order to support Partition Commands for datasourcev2 and Lakehouse, I'm
trying to add Partition API for multiple Catalog.

They are widely used APIs in mysql or hive or other datasources, we can use
these API to mange Partition metadata in Lakehouse.

JIRA: https://issues.apache.org/jira/browse/SPARK-31694
PR: https://github.com/apache/spark/pull/28617

We have already use these APIs to support Lakehouse on Delta Lake and hive
on datasourcev2, and it does solves partition supports on datasourcev2.
Could anyone review it?

Thanks,
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Resolve _temporary directory uncleaned

2020-07-16 Thread JackyLee
hi devs,

In InsertIntoHiveTable and InsertIntoHiveDirCommand, we use
deleteExternalTmpPath to clean temporary directories after Job committed and
cancel deleteOnExit if succeeded.But sometimes (e.g., when speculative task
is enabled), temporary directories may be left uncleaned. This is happened
if there are still some tasks running after we called deleteExternalTmpPath. 
Thus I tried to add JobCleaned Status to clean temporary directories. The
JobCleaned Status is happened with all stages has been cleaned in a job,
thus it is a good way to do some job cleanings.

JIRA: https://issues.apache.org/jira/browse/SPARK-31438
PR: https://github.com/apache/spark/pull/28280

There have been some discuss in https://github.com/apache/spark/pull/28129.

This PR has been submitted for about 3 months. Any one could review this?

Thanks,
Jacky Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread JackyLee
Thank you for putting forward this.
Can we put the support of view and partition catalog in version 3.1? 
AFAIT, these are great features in DSv2 and Catalog. With these, we can work
well with warehouse, such as delta or hive.

https://github.com/apache/spark/pull/28147
https://github.com/apache/spark/pull/28617

Thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread JackyLee
+1. Agree with Xiao Li and Jungtaek Lim.

This seems to be controversial, and can not be done in a short time. It is
necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Supporting hive on DataSourceV2

2020-03-24 Thread JackyLee
Hi Blue,

I have created a jira for supporting hive on DataSourceV2,we can associate
specific modules on this jira.
https://issues.apache.org/jira/browse/SPARK-31241

Could you provide a google doc for current design, so that we can discuss
and improve it in detail here?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Supporting hive on DataSourceV2

2020-03-23 Thread JackyLee
Glad to hear that you have already supported it, that is just the thing we
are doing. And these exceptions you said doesn't conflict with hive support,
we can easily make it compatible. 

>Do you have an idea about where the connector should be developed? I don’t
think it makes sense for it to be part of Spark. That would keep complexity
in the main project and require updating Hive versions slowly. Using a
separate project would mean less code in Spark specific to one source, and 
could more easily support multiple Hive versions. Maybe we should create a
project for catalog plug-ins?

AFAIT, it is necessary to create a new project, users need to create their
own Connector according to their own needs. In our implementation of Hive on
DataSourceV2, we put the basic Partition API and Commands in the main
project,  and put a default version HiveCatalog and HiveConnector into the
external project. Users can use our project and can also implement their own
HiveConnector. Maybe this is a good way to support.

Look forward to your patch submission, we can cooperate in this area.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Supporting hive on DataSourceV2

2020-03-23 Thread JackyLee
Hi devs,
I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
now working on a project using DataSourceV2 to provide multiple source
support and it works with the data lake solution very well, yet it does not
yet support HiveTable.

There are 3 reasons why we need to support Hive on DataSourceV2.
1. Hive itself is one of Spark data sources.
2. HiveTable is essentially a FileTable with its own input and output
formats, it works fine with FileTable.
3. HiveTable should be stateless, and users can freely read or write Hive
using batch or microbatch.

We implemented stateless Hive on DataSourceV1, it supports user to write
into Hive on streaming or batch and it has widely used in our company.
Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
Catalog and DDL Commands have already been supported. 

Looking forward to more discussions on this.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Question about spark on k8s

2020-01-03 Thread JackyLee
Hello, devs.

In our scenario, we run spark on Kata-like containers, and found the code
had written the Kube-DNS domain. If Kube-DNS is not configured in
environment, tasks would run failed.

My question is, why we wrote the domain name of Kube-DNS in the code? Isn't
it better to read domain name from the service?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-19 Thread JackyLee
+1



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread JackyLee
Congrats, Joe!

Best,
Jacky



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread JackyLee
Agree with rxin. Maybe we should consider about these PRs, especially those
large PRs, after DataSource V2 API is ready.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support SqlStreaming in spark

2018-12-27 Thread JackyLee
Hi, Wenchen

Thank you for your recognition of Streaming on sql. I have written the
SQLStreaming design document:
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#

Your Questions are answered in here:
https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit#heading=h.t96f9l205fk1

There may be some details that I have not considered, we can discuss it in
more depth.

Thanks



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support SqlStreaming in spark

2018-12-25 Thread JackyLee
No problem



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support SqlStreaming in spark

2018-12-21 Thread JackyLee
Hi wenchen
I have been working at SQLStreaming for a year, and I have promoted it
in company. 
I have seen the design for Kafka or the Calcite, and I believe my design
is better than them. They support pure-SQL not table API for streaming.
Users can only use the specified Streaming statement, and the same statement
can't run Batch queries.
But in my opinion, the Table API is actually  the key to solve
SQLStreaming, pure-SQL is just another expression of the Streaming Table
API.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support SqlStreaming in spark

2018-12-21 Thread JackyLee
Hi wenchen and Arun Mahadevan
Thanks for your reply.

SQLStreaming is not just a way to support pure-SQL, but also a way to
define table api for Streaming.
I have redefined the SQLStreaming to make it support table API. User can
use sql or table API to run SQLStreaming. 

I will update the design document of SQLStreaming. Could you help me
improve the design doc?

Again, thanks for your attention.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Why use EMPTY_DATA_SCHEMA when creating a datasource table

2018-12-17 Thread JackyLee
Hi, everyone

I have some questions about creating a datasource table.
In HiveExternalCatalog.createDataSourceTable,
newSparkSQLSpecificMetastoreTable will replace the table schema with
EMPTY_DATA_SCHEMA and table.partitionSchema.
So,Why we use EMPTY_DATA_SCHEMA? Why not declare schema in other way?
There are a lot of datasource tables that don't have partitionSchema, so
they will be replaced as EMPTY_DATA_SCHEMA?
Even if Spark itself can parse, what if the user views the table
information from the Hive side?

Any one can help me?
thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Public v2 interface location

2018-11-30 Thread JackyLee
Hi, Ryan Blue.

I don't think it would be a good idea to add the sql-api module. 
I prefer to add sql-api to sql/core. The sql is just another representation
of dataset, thus there is no need to add new module to do this. Besides, it
would be easier to add sql-api in core.

By the way, I don't think it's a good time to add sql api, we have not yet
determined many details of the DataSource V2 API.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DataSourceV2 community sync #3

2018-11-27 Thread JackyLee
+1

Please add me to the Google Hangout invite. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DataSourceV2 capability API

2018-11-12 Thread JackyLee
I don't know if it is a right thing to make table API as
ContinuousScanBuilder -> ContinuousScan -> ContinuousBatch, it makes
batch/microBatch/Continuous too different from each other.
In my opinion, these are basically similar at the table level. So is it
possible to design an API like this?
ScanBuilder -> Scan -> ContinuousBatch/MicroBatch/SingleBatch



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Plan on Structured Streaming in next major/minor release?

2018-11-04 Thread JackyLee
Can these things be added into this list?
1. [SPARK-24630] Support SQLStreaming in Spark
  This patch defines the Table API for StructStreaming
2. [SPARK-25937] Support user-defined schema in Kafka Source & Sink
  This patch make user easier to work with StructStreaming
3. SS supports dynamic partition scheduling 
   SS uses the serial execution engine, which means, SS can not catch up
with data output effectively when back pressure or computing speed is
reduced. If the dynamic partition scheduling for SS can be realized, the
partition number will be automatically increased when needed, then SS can
effectively catch up with the calculation speed.The main idea is to replace
time with computing resources.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support SqlStreaming in spark

2018-10-21 Thread JackyLee
The code of SQLStreaming has been pushed:

https://github.com/apache/spark/pull/22575



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: data source api v2 refactoring

2018-10-21 Thread JackyLee
I have pushed a patch for SQLStreaming, which just resolved the problem just
discussed.
the Jira:
https://issues.apache.org/jira/browse/SPARK-24630
the Patch:
https://github.com/apache/spark/pull/22575

SQLStreaming just defined the table API for StructStreaming, and the Table
APIs for Streaming and batch are are fully compatible. 

With SQLStreaming, we can create a streaming just like this:
val table = spark.createTable()
spark.table(temp)



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Plan on Structured Streaming in next major/minor release?

2018-10-21 Thread JackyLee
Thanks for raising them.

FYI, I believe this open issues could also be considered:

https://issues.apache.org/jira/browse/SPARK-24630
  

An new ability to express Struct Streaming on pure SQL. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support SqlStreaming in spark

2018-06-28 Thread JackyLee
Spark JIRA:
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24630

Benefits:

Firstly, users, who are unfamiliar with streaming, can easily use SQL to run
StructStreaming especially when migrating offline tasks to real time
processing tasks.
Secondly, support SQL API in StructStreaming can also combine
StructStreaming with hive. Users can store the source/sink metadata in a
table and use hive metastore to manage it. The users, who want to read this
data, can easily create a stream by accessing the table, which can greatly
reduce the development cost and maintenance costs of StructStreaming.
Finally, easy to achieve unified management and authority control of source
and sink, and more controllable in the management of some private data,
especially in some financial or security area.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Support SqlStreaming in spark

2018-06-14 Thread JackyLee
Hello 

Nowadays, more and more streaming products begin to support SQL streaming,
such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
only reduce the threshold of streaming, but also make streaming easier to be
accepted by everyone. 

At present, StructStreaming is relatively mature, and the StructStreaming is
based on DataSet API, which make it possibal to  provide a SQL portal for
structstreaming and run structstreaming in SQL. 

To support for SQL Streaming, there are two key points: 
1, Analysis should be able to parse streaming type SQL. 
2, Analyzer should be able to map metadata information to the corresponding
Relation. 

Running StructStreaming in SQL can bring some benefits. 
1, Reduce the entry threshold of StructStreaming and attract users more
easily. 
2, Encapsulate the meta information of source or sink into table, maintain
and manage uniformly, and make users more accessible. 
3. Metadata permissions management, which is based on hive, can control
StructStreaming's overall authority management scheme more closely. 

We have found some ways to solve this problem. It's a pleasure to discuss it
with you. 

Thanks,  

Jackey Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org