Re: Pyspark: Issue using sql in foreachBatch sink

2020-08-03 Thread muru
Thanks Jungtaek for your help. On Fri, Jul 31, 2020 at 6:31 PM Jungtaek Lim wrote: > Python doesn't allow abbreviating () with no param, whereas Scala does. > Use `write()`, not `write`. > > On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > >> In a pyspark SS job, trying to use sql instead of sql

回复:What is an "analytics engine"?

2020-08-03 Thread tianlangstudio
Hello, Sir Engine means Spark has the feature to process data. But it is musted to be component with other for building data platform. Data Platform likes a car and Spark likes the motor. I am wrong, Maybe. TianlangStudio Some of the biggest lies: I will start tomorrow/Others are better

RE: DataSource API v2 & Spark-SQL

2020-08-03 Thread Lavelle, Shawn
Thanks for clarifying, Russel. Is spark native catalog reference on the roadmap for dsv2 or should I be trying to use something else? ~ Shawn From: Russell Spitzer [mailto:russell.spit...@gmail.com] Sent: Monday, August 3, 2020 8:27 AM To: Lavelle, Shawn Cc: user Subject: Re: DataSource API

Re: CVE-2020-9480: Apache Spark RCE vulnerability in auth-enabled standalone master

2020-08-03 Thread Sean Owen
I'm resending this CVE from several months ago to user@ and dev@, as we understand that a tool to exploit it may be released soon. The most straightforward mitigation for those that are affected (using the standalone master, where spark.authenticate is necessary) is to update to 2.4.6 or 3.0.0+.

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Henrique Oliveira
Thank you for both tips, I will definitely try the pandas_udfs. About changing the select operation, it's not possible to have multiple explode functions on the same select, sadly they must be applied one at a time. Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy < pmccar...@dstillery.com>

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
If you use pandas_udfs in 2.4 they should be quite performant (or at least won't suffer serialization overhead), might be worth looking into. I didn't run your code but one consideration is that the while loop might be making the DAG a lot bigger than it has to be. You might see if defining those

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Henrique Oliveira
Hi Patrick, thank you for your quick response. That's exactly what I think. Actually, the result of this processing is an intermediate table that is going to be used for other views generation. Another approach I'm trying now, is to move the "explosion" step for this "view generation" step, this

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something? On Sat, Aug 1, 2020 at 8:34 PM hesouol wrote: > I forgot to add an information. By "can't

Re: DataSource API v2 & Spark-SQL

2020-08-03 Thread Russell Spitzer
That's a bad error message. Basically you can't make a spark native catalog reference for a dsv2 source. You have to use that Datasources catalog or use the programmatic API. Both dsv1 and dsv2 programattic apis work (plus or minus some options) On Mon, Aug 3, 2020, 7:28 AM Lavelle, Shawn wrote:

DataSource API v2 & Spark-SQL

2020-08-03 Thread Lavelle, Shawn
Hello Spark community, I have a custom datasource in v1 API that I'm trying to port to v2 API, in Java. Currently I have a DataSource registered via catalog.createTable(name, , schema, options map). When trying to do this in data source API v2, I get an error saying my class (package)

What is an "analytics engine"?

2020-08-03 Thread Boris Gershfield
Hi, I'm new to Apache Spark and am trying to write an essay about Big Data platforms. On the Apache Spark homepage we are told that "Apache Spark™ is a unified analytics engine for large-scale data processing". I don't fully understand the meaning of "engine" and nor can I find a standard