Re:Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-02-03 Thread 大啊
+1 for this work!
I agree with some clarifications below:
SQL is a somewhat different case. There are functions that aren't _that_ useful 
in general, kind of niche, but nevertheless exist in other SQL systems, most 
notably Hive. It's useful to try to expand SQL support to cover those to ease 
migration and interoperability. But it may not make enough sense to maintain 
those functions in Scala, and Python, and R, because they're niche.







At 2021-01-29 04:40:07, "Sean Owen"  wrote:

I think I can articulate the general idea here, though I expect it is not 
deployed consistently.


Yes there's a general desire to make APIs consistent across languages. Python 
and Scala should track pretty closely, even if R isn't really that consistent.


SQL is a somewhat different case. There are functions that aren't _that_ useful 
in general, kind of niche, but nevertheless exist in other SQL systems, most 
notably Hive. It's useful to try to expand SQL support to cover those to ease 
migration and interoperability. But it may not make enough sense to maintain 
those functions in Scala, and Python, and R, because they're niche.


I think that was what you saw with regexp_extract_all. As you can see there 
isn't perfect agreement on where to draw those lines. But I think the theory 
has been mostly consistent over time, if not the execution.


It isn't that regexp_extract_all (for example) is useless outside SQL, just, 
where do you draw the line? Supporting 10s of random SQL functions across 3 
other languages has a cost, which has to be weighed against benefit, which we 
can never measure well except anecdotally: one or two people say "I want this" 
in a sea of hundreds of thousands of users.


(I'm not sure about CalendarType - just I know that date/time types are hard 
even within, say, the JVM, let alone across languages)


For this specific case, I think there is a fine argument that 
regexp_extract_all should be added simply for consistency with regexp_extract. 
I can also see the argument that regexp_extract was a step too far, but, what's 
public is now a public API.


I'll also say that the cost of adding API functions grows as a project matures, 
and whereas it might have made sense to add this at an earlier time, it might 
not make sense now.


I come out neutral on this specific case, but would not override the opinion of 
other committers. But I hope that explains the logic that I think underpins 
what you're hearing.








On Thu, Jan 28, 2021 at 2:23 PM MrPowers  wrote:

Thank you all for your amazing work on this project.  Spark has a great
public interface and the source code is clean.  The core team has done a
great job building and maintaining this project.  My emails / GitHub
comments focus on the 1% that we might be able to improve.

Pull requests / suggestions for improvements can come across as negative,
but I'm nothing but happy & positive about this project.  The source code is
delightful to read and the internal abstractions are beautiful.

*API consistency*

The SQL, Scala, and Python APIs are generally consistent.  They all have a
reverse function for example.

Some of the new PRs have arguments against consistent rollout of functions
across the APIs.  This seems like a break in the traditional Spark
development process when functions were implemented in all APIs (except for
functions that only make sense for certain APIs like createDataset and
toDS).

The default has shifted from consistent application of function across APIs
to "case by case determination".

*Examples*

* The regexp_extract_all function was recently added to the SQL API.  It was
then added to the Scala API,  but then removed from the Scala API
  .

* There is an ongoing discussion on  if CalendarType will be added to the
Python API  

*Arguments against adding functions like regexp_extract_all to the Scala
API:*

* Some of these functions are SQL specific and don't make sense for the
other languages

* Scala users can access the SQL functions via expr

*Argument rebuttal*

I don't understand the "some of the functions are SQL specific argument".
regexp_extract_all fills a gap in the API.  Users have been forced to use
UDF workarounds for this in the past.  Users from all APIs need this
solution. 

Using expr isn't developer friendly.  Scala / Python users don't want to
manipulate SQL strings.  Nesting functions in SQL strings is complicated.
The quoting and escaping is all different.  Figuring out how to invoke
regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain -
would need to figure out SQL quoting, SQL escaping, and how to access column
names instead of a column object.

Any of the org.apache.spark.sql.functions can be invoked via expr.  The core
reason the Scala/Python APIs exist is so that developers don't need to
manipulate strings for expr.

regexp_extract_all should be added to the Scala API for the sa

Re: [DISCUSS] Add RocksDB StateStore

2021-02-03 Thread redsk
Hi, 

FYI, I have been using the project at
https://github.com/chermenin/spark-states
for a few months and it has been working well for me.

-Nico



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-02-03 Thread Kent Yao







Sending https://github.com/apache/spark/pull/31460Based my research so far, when there is there is an existingio.file.buffer.size in hive-site.xml, the hadoopConf finallly get reset by that. In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share thehive-site.xm for their hive jobs and make a copy to SPARK_HOM/conf w/o modification. In Spark, when we generate Hadoop configurations, we will usespark.buffer.size(65536) to resetio.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.The PR fixes:1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop2. This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml






  



















Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.















 


On 02/3/2021 15:36,Maxim Gekk wrote: 


Hi All,> Also I am investigating a performance regression in some TPC-DS queries (q88 for instance) that is caused by a recent commit in 3.1 ...I have found that the perf regression is caused by the Hadoop config:io.file.buffer.size = 4096Before the commit https://github.com/apache/spark/commit/278f6f45f46ccafc7a31007d51ab9cb720c9cb14, we had:io.file.buffer.size = 65536 Maxim GekkSoftware EngineerDatabricks, Inc.On Wed, Feb 3, 2021 at 2:37 AM Hyukjin Kwon  wrote:Yeah, agree. I changed. Thanks for the heads up. Tom.2021년 2월 3일 (수) 오전 8:31, Tom Graves 님이 작성:
ok thanks for the update. That is marked as an improvement, if its a blocker can we mark it as such and describe why.  I searched jiras and didn't see any critical or blockers open.Tom





On Tuesday, February 2, 2021, 05:12:24 PM CST, Hyukjin Kwon  wrote:



There is one here: https://github.com/apache/spark/pull/31440. There look several issues being identified (to confirm that this is an issue in OSS too), and fixed in parallel.There are a bit of unexpected delays here as several issues  more were found. I will try to file and share relevant JIRAs as soon as I can confirm.2021년 2월 3일 (수) 오전 2:36, Tom Graves 님이 작성:
Just curious if we have an update on next rc? is there a jira for the tpcds issue?Thanks,Tom





On Wednesday, January 27, 2021, 05:46:27 PM CST, Hyukjin Kwon  wrote:



Just to share the current status, most of the known issues were resolved. Let me know if there are some more.One thing left is a performance regression in TPCDS being investigated. Once this is identified (and fixed if it should be), I will cut another RC right away.I roughly expect to cut another RC next Monday.Thanks guys.2021년 1월 27일 (수) 오전 5:26, Terry Kim 님이 작성:Hi,Please check if the following regression should be included: https://github.com/apache/spark/pull/31352Thanks,TerryOn Tue, Jan 26, 2021 at 7:54 AM Holden Karau  wrote:If were ok waiting for it, I’d like to get https://github.com/apache/spark/pull/31298 in as well (it’s not a regression but it is a bug fix).On Tue, Jan 26, 2021 at 6:38 AM Hyukjin Kwon  wrote:It looks like a cool one but it's a pretty big one and affects the plans considerably ... maybe it's best to avoid adding it into 3.1.1 in particular during the RC period if this isn't a clear regression that affects many users.2021년 1월 26일 (화) 오후 11:23, Peter Toth 님이 작성:Hey,Sorry for chiming in a bit late, but I would like to suggest my PR (https://github.com/apache/spark/pull/28885) for review and inclusion into 3.1.1.Currently, invalid reuse reference nodes appear in many queries, causing performance issues and incorrect explain plans. Now that https://github.com/apache/spark/pull/31243 got merged these invalid

Re: Public API access to UDTs

2021-02-03 Thread Sean Owen
I opened https://github.com/apache/spark/pull/31461 to track the discussion
further. It narrowly proposes making a few types public.

On Mon, Feb 1, 2021 at 8:52 AM Fitch, Simeon  wrote:

> 🙇
>
> On Mon, Feb 1, 2021 at 9:38 AM Sean Owen  wrote:
>
>> I'm not hearing any objection to making it public as a @DeveloperApi ?
>> anyone object to a PR on that?
>>
>> On Fri, Jan 29, 2021 at 8:46 AM Sean Owen  wrote:
>>
>>> I'm also interested: are there problems with opening up this API beyond
>>> needing to freeze it and keep it stable? it's pretty stable.
>>> As @DeveloperApi at least?
>>> Are there implications for storing UDTs in particular engines or formats?
>>> Just making it public for developers, even with a 'use at your own risk'
>>> warning, seems pretty small as a change?
>>>
>>> On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon  wrote:
>>>
 Hi,

 First time posting here, so apologies if I need to be directing this
 topic elsewhere.

 I'm the author of RasterFrames, and a contributor to GeoMesa's Spark
 SQL module. Both make use of decently low level Catalyst constructs,
 include custom UDTs; RasterFrames introduces a geospatial raster type, and
 GeoMesa a geometry type.

 In order to make this work we've circumvented the [`package private`](
 https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting
 sibling classes into the package namespace. It's a hack, and works fine
 with JVM 8, but violates the [much more restrictive](
 https://bit.ly/3aadO5g) module constructs in JVM 9+.

 We've been monitoring [SPARK-7768](
 https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and
 it's [associated PR](https://github.com/apache/spark/pull/16478) for
 years now, but it keeps getting kicked down the road(map).

 As authors of open source systems we completely understand how and why
 this happens, but we are at a critical juncture in our projects' lifecycle,
 anchored to JVM 8 while other systems have moved on to later versions. We'd
 also like to enjoy the benefits of later JVMs.

 So... I'm here to find out how I and others critically needing public
 access to `UDTRegistration` might better advocate for it?

 I think (but not 100% sure) the PR linked above is more extensive than
 what we need, also addressing usability around Encoders, for which we have
 our own type class solution. My assumption to date has been all we need is
 line 32 of `UDTRegistration` deleted (if there's folly therein, please say
 so!). While I understand a reluctance to promote `UDTRegistration` to
 `public`, I note that it has not been changed since 2016, perhaps a good
 indicator that the API is stable enough. Marking it as `@Experimental`
 could be a compromise option.

 Thanks for reading this far and giving this consideration. Any and all
 advice is appreciated.

 Simeon (@metasim)


 --
 Simeon Fitch
 Co-founder & VP of R&D
 Astraea, Inc.


>
> --
> Simeon Fitch
> Co-founder & VP of R&D
> Astraea, Inc.
>
>