Observable Metrics on Spark Datasets

2021-03-15 Thread Enrico Minack

Hi Spark-Devs,

the observable metrics that have been added to the Dataset API in 3.0.0 
are a great improvement over the Accumulator APIs that seem to have much 
weaker guarantees. I have two questions regarding follow-up contributions:


*1. Add observe to Python ***DataFrame**

As I can see from master branch, there is no equivalent in the Python 
API. Is this something planned to happen, or is it missing because there 
are not listeners in PySpark which renders this method useless in 
Python. I would be happy to contribute here.


*2. Add Observation class to simplify result access
*

The Dataset.observe method requires users to register listeners 
 
with QueryExecutionListener or StreamingQUeryListener to obtain the 
result. I think for simple setups, this could be hidden behind a common 
helper class here called Observation, which reduces the usage of observe 
to three lines of code:


// our Dataset (this does not count as a line of code) val df =Seq((1, "a"), (2, "b"), (4, "c"), (8, 
"d")).toDF("id", "value")

// define the observation we want to make val observation =Observation("stats", 
count($"id"), sum($"id"))

// add the observation to the Dataset and execute an action on it val cnt = 
df.observe(observation).count()

// retrieve the result assert(observation.get ===Row(4, 15))

The Observation class can handle the registration and de-registration of 
the listener, as well as properly accessing the result across thread 
boundaries.


With *2.*, the observe method can be added to PySpark without 
introducing listeners there at all. All the logic is happening in the JVM.


Thanks for your thoughts on this.

Regards,
Enrico



Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Maciej
I concur. These two don't have the same target audience or
expressiveness. I cannot imagine most of the PySpark projects I've seen
to switch to Pandas-style API.

If this is to be included, it would be great if we could model similar
to SQLAlchemy, with its core and ORM components being equally important
parts of the API.

On 3/15/21 7:12 AM, Reynold Xin wrote:
> I don't think we should deprecate existing APIs.
>
> Spark's own Python API is relatively stable and not difficult to
> support. It has a pretty large number of users and existing code. Also
> pretty easy to learn by data engineers.
>
> pandas API is a great for data science, but isn't that great for some
> other tasks. It's super wide. Great for data scientists that have
> learned it, or great for copy paste from Stackoverflow.
>
>
>
> On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun
> mailto:dongjoon.h...@gmail.com>> wrote:
>
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas
> (because we don't remove the existing APIs in general)?
>
> > Fourthly, PySpark is still not Pythonic enough. For example, I
> hear complaints such as "why does
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to
> learn", and APIs are very difficult to change
> > in Spark (as I emphasized above). 
>
>
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon  > wrote:
>
> Firstly my biggest reason is that I would like to promote this
> more as a built-in support because it is simply
> important to have it with the impact on the large user group,
> and the needs are increasing
> as the charts indicate. I usually think that features or
> add-ons stay as third parties when it’s rather for a
> smaller set of users, it addresses a corner case of needs,
> etc. I think this is similar to the datasources
> we have added. Spark ported CSV and Avro because more and more
> people use it, and it became important
> to have it as a built-in support.
>
> Secondly, Koalas needs more help from Spark, PySpark, Python
> and pandas experts from the
> bigger community. Koalas’ team isn’t experts in all the areas,
> and there are many missing corner
> cases to fix, Some require deep expertise from specific areas.
>
> One example is the type hints. Koalas uses type hints for
> schema inference.
> Due to the lack of Python’s type hinting way, Koalas added its
> own (hacky) way
> 
> .
> Fortunately the way Koalas implemented is now partially
> proposed into Python officially (PEP 646).
> But Koalas could have been better with interacting with the
> Python community more and actively
> joining in the design issues together to lead the best output
> that benefits both and more projects.
>
> Thirdly, I would like to contribute to the growth of PySpark.
> The growth of the Koalas is very fast given the
> internal and external stats. The number of users has jumped up
> twice almost every 4 ~ 6 months.
> I think Koalas will be a good momentum to keep Spark up.
>
> Fourthly, PySpark is still not Pythonic enough. For example, I
> hear complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to
> learn", and APIs are very difficult to change
> in Spark (as I emphasized above). This set of Koalas APIs will
> be able to address these concerns
> in PySpark.
>
> Lastly, I really think PySpark needs its native plotting
> features. As I emphasized before with
> elaboration, I do think this is an important feature missing
> in PySpark that users need.
> I do think Koalas completes what PySpark is currently missing.
>
>
>
> 2021년 3월 14일 (일) 오후 7:12, Sean Owen  >님이 작성:
>
> I like koalas a lot. Playing devil's advocate, why not
> just let it continue to live as an add on? Usually the
> argument is it'll be maintained better in Spark but it's
> well maintained. It adds some overhead to maintaining
> Spark conversely. On the upside it makes it a little more
> discoverable. Are there more 'synergies'?
>
> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon
> mailto:gurwls...@gmail.com>> wrote:
>
> Hi all,
>
>
> I would like to start the discussion on supporting
> pandas API layer on Spark.
>
>  
>
> If we have a general consensus on having it in
>   

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin  wrote:

> I don't think we should deprecate existing APIs.
>

+1

I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
could be wrong, but I wager most people who have worked with both Spark and
Pandas feel the same way.

For the large community of current PySpark users, or users switching to
PySpark from another Spark language API, it doesn't make sense to deprecate
the current API, even by convention.


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Ismaël Mejía
+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
 wrote:
>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin  wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could 
> be wrong, but I wager most people who have worked with both Spark and Pandas 
> feel the same way.
>
> For the large community of current PySpark users, or users switching to 
> PySpark from another Spark language API, it doesn't make sense to deprecate 
> the current API, even by convention.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
And a late +1 from me.

On Fri, Mar 12, 2021 at 5:46 AM Takeshi Yamamuro 
wrote:

> +1, too.
>
> On Fri, Mar 12, 2021 at 8:51 PM kordex  wrote:
>
>> +1 (for what it's worth). It will definitely help our efforts.
>>
>> On Fri, Mar 12, 2021 at 12:14 PM Gengliang Wang  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > On Fri, Mar 12, 2021 at 3:00 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> 2021년 3월 12일 (금) 오후 2:54, Jungtaek Lim 님이
>> 작성:
>> >>>
>> >>> +1 (non-binding) Excellent description on SPIP doc! Thanks for the
>> amazing effort!
>> >>>
>> >>> On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh 
>> wrote:
>> 
>> 
>>  +1 (non-binding).
>> 
>>  Thanks for the work!
>> 
>> 
>>  Erik Krogen wrote
>>  > +1 from me (non-binding)
>>  >
>>  > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao <
>> 
>>  > huaxin.gao11@
>> 
>>  > > wrote:
>>  >
>>  >> +1 (non-binding)
>> 
>> 
>> 
>> 
>> 
>>  --
>>  Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> 
>>  -
>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


-- 
Ryan Blue
Software Engineer
Netflix


[RESULT] [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
This SPIP is adopted with the following +1 votes and no -1 or +0 votes:

Holden Karau*
John Zhuge
Chao Sun
Dongjoon Hyun*
Russell Spitzer
DB Tsai*
Wenchen Fan*
Kent Yao
Huaxin Gao
Liang-Chi Hsieh
Jungtaek Lim
Hyukjin Kwon*
Gengliang Wang
kordex
Takeshi Yamamuro
Ryan Blue

* = binding

On Mon, Mar 8, 2021 at 3:55 PM Ryan Blue  wrote:

> Hi everyone, I’d like to start a vote for the FunctionCatalog design
> proposal (SPIP).
>
> The proposal is to add a FunctionCatalog interface that can be used to
> load and list functions for Spark to call. There are interfaces for scalar
> and aggregate functions.
>
> In the discussion we’ve come to consensus and I’ve updated the design doc
> to match how functions will be called:
>
> In addition to produceResult(InternalRow), which is optional, functions
> can define produceResult methods with arguments that are Spark’s internal
> data types, like UTF8String. Spark will prefer these methods when calling
> the UDF using codgen.
>
> I’ve also updated the AggregateFunction interface and merged it with the
> partial aggregate interface because Spark doesn’t support non-partial
> aggregates.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit#heading=h.82w8qxfl2uwl
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do
> a final update of the PR and we can merge the API.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> --
> Ryan Blue
>


-- 
Ryan Blue


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
To update with current status.

There are three tickets targeting 2.4 that are still ongoing.

SPARK-34719: Correctly resolve the view query with duplicated column names
SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error
on jdk8u
SPARK-34726: Fix collectToPython timeouts

SPARK-34719 doesn't have PR for 2.4 yet.

SPARK-34607 and SPARK-34726 are under review. SPARK-34726 is a bit arguable
as it involves a behavior change even it is very rare case. Welcome any
suggestion on the PR if any. Thanks.



Dongjoon Hyun-2 wrote
> Thank you for the update.
> 
> +1 for your plan.
> 
> Bests,
> Dongjoon.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Observable Metrics on Spark Datasets

2021-03-15 Thread Jungtaek Lim
If I remember correctly, the major audience of the "observe" API is
Structured Streaming, micro-batch mode. From the example, the abstraction
in 2 isn't something working with Structured Streaming. It could be still
done with callback, but it remains the question how much complexity is
hidden from abstraction.

I see you're focusing on PySpark - I'm not sure whether there's intention
on not exposing query listener / streaming query listener to PySpark, but
if there's some valid reason to do so, I'm not sure we do like to expose
them to PySpark in any way. 2 isn't making sense to me with PySpark - how
do you ensure all the logic is happening in the JVM and you can leverage
these values from PySpark?
(I see there's support for listeners with DStream in PySpark, so there
might be reasons not to add the same for SQL/SS. Probably a lesson learned?)


On Mon, Mar 15, 2021 at 6:59 PM Enrico Minack 
wrote:

> Hi Spark-Devs,
>
> the observable metrics that have been added to the Dataset API in 3.0.0
> are a great improvement over the Accumulator APIs that seem to have much
> weaker guarantees. I have two questions regarding follow-up contributions:
>
> *1. Add observe to Python **DataFrame*
>
> As I can see from master branch, there is no equivalent in the Python API.
> Is this something planned to happen, or is it missing because there are not
> listeners in PySpark which renders this method useless in Python. I would
> be happy to contribute here.
>
>
> *2. Add Observation class to simplify result access *
>
> The Dataset.observe method requires users to register listeners
> 
> with QueryExecutionListener or StreamingQUeryListener to obtain the
> result. I think for simple setups, this could be hidden behind a common
> helper class here called Observation, which reduces the usage of observe
> to three lines of code:
>
> // our Dataset (this does not count as a line of code)val df = Seq((1, "a"), 
> (2, "b"), (4, "c"), (8, "d")).toDF("id", "value")
> // define the observation we want to makeval observation = 
> Observation("stats", count($"id"), sum($"id"))
> // add the observation to the Dataset and execute an action on itval cnt = 
> df.observe(observation).count()
> // retrieve the resultassert(observation.get === Row(4, 15))
>
> The Observation class can handle the registration and de-registration of
> the listener, as well as properly accessing the result across thread
> boundaries.
>
> With *2.*, the observe method can be added to PySpark without introducing
> listeners there at all. All the logic is happening in the JVM.
>
> Thanks for your thoughts on this.
>
> Regards,
> Enrico
>


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Takeshi Yamamuro
Hi, viirya

I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a
malformed class name error
on jdk8u" .

Bests,
Takeshi

On Tue, Mar 16, 2021 at 4:45 AM Liang-Chi Hsieh  wrote:

> To update with current status.
>
> There are three tickets targeting 2.4 that are still ongoing.
>
> SPARK-34719: Correctly resolve the view query with duplicated column names
> SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error
> on jdk8u
> SPARK-34726: Fix collectToPython timeouts
>
> SPARK-34719 doesn't have PR for 2.4 yet.
>
> SPARK-34607 and SPARK-34726 are under review. SPARK-34726 is a bit arguable
> as it involves a behavior change even it is very rare case. Welcome any
> suggestion on the PR if any. Thanks.
>
>
>
> Dongjoon Hyun-2 wrote
> > Thank you for the update.
> >
> > +1 for your plan.
> >
> > Bests,
> > Dongjoon.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
Thank you so much, Takeshi!


Takeshi Yamamuro wrote
> Hi, viirya
> 
> I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a
> malformed class name error
> on jdk8u" .
> 
> Bests,
> Takeshi
> 
> 
> Takeshi Yamamuro





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org