RE: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-12 Thread Liang Chi Hsieh
I’d vote my +1 first.

On 2021/11/13 02:25:05 "L. C. Hsieh" wrote:
> Hi all,
> 
> I’d like to start a vote for SPIP: Row-level operations in Data Source V2.
> 
> The proposal is to add support for executing row-level operations
> such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
> execution should be the same across data sources and the best way to do
> that is to implement it in Spark.
> 
> Right now, Spark can only parse and to some extent analyze DELETE, UPDATE,
> MERGE commands. Data sources that support row-level changes have to build
> custom Spark extensions to execute such statements. The goal of this effort
> is to come up with a flexible and easy-to-use API that will work across
> data sources.
> 
> Please also refer to:
> 
>- Previous discussion in dev mailing list: [DISCUSS] SPIP:
> Row-level operations in Data Source V2
>
> 
>- JIRA: SPARK-35801 
>- PR for handling DELETE statements:
> 
> 
>- Design doc
> 
> 
> Please vote on the SPIP for the next 72 hours:
> 
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE][RESULT] SPIP: Storage Partitioned Join for Data Source V2

2021-11-02 Thread Liang Chi Hsieh
Hi all,

The vote passed with the following 9 +1 votes and no -1 or +0 votes:
Liang-Chi Hsieh*
Russell Spitzer
Dongjoon Hyun*
Huaxin Gao
Ryan Blue
DB Tsai*
Holden Karau*
Cheng Su
Wenchen Fan*

* = binding

Thank you guys all for your feedback and votes.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-18 Thread Liang-Chi Hsieh
+1. Docs looks good. Binary looks good.

Ran simple test and some tpcds queries.

Thanks for working on this!


wuyi wrote
> Please vote on releasing the following candidate as Apache Spark version
> 3.0.3.
> 
> The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.0.3
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.0.3-rc1 (commit
> 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
> https://github.com/apache/spark/tree/v3.0.3-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1386/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/
> 
> The list of bug fixes going into 3.0.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349723
> 
> This release is using the release script of the tag v3.0.3-rc1.
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 3.0.3?
> ===
> 
> The current list of open tickets targeted at 3.0.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.3
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh


Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
As it is soft cut date, there is no reason to postpone it.

It sounds good then to keep original branch cut date.

Thank you.



Dongjoon Hyun-2 wrote
> Thank you for volunteering, Gengliang.
> 
> Apache Spark 3.2.0 is the first version enabling AQE by default. I'm also
> watching some on-going improvements on that.
> 
> https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive Query
> Execution QA)
> 
> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
> cut and the committers still are able to commit to `branch-3.3` according
> to their decisions.
> 
> Given that Apache Spark had 115 commits in a week in various areas
> concurrently, we should start QA for Apache Spark 3.2 by creating
> branch-3.3 and allowing only limited backporting.
> 
> https://github.com/apache/spark/graphs/commit-activity
> 
> Bests,
> Dongjoon.
> 
> 
> On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>> First, thanks for being volunteer as the release manager of Spark 3.2.0,
>> Gengliang!
>>
>> And yes, for the two important Structured Streaming features, RocksDB
>> StateStore and session window, we're working on them and expect to have
>> them
>> in the new release.
>>
>> So I propose to postpone the branch cut date.
>>
>> Thank you!
>>
>> Liang-Chi
>>
>>
>> Gengliang Wang-2 wrote
>> > Thanks, Hyukjin.
>> >
>> > The expected target branch cut date of Spark 3.2 is *July 1st* on
>> > https://spark.apache.org/versioning-policy.html. However, I notice that
>> > there are still multiple important projects in progress now:
>> >
>> > [Core]
>> >
>> >- SPIP: Support push-based shuffle to improve shuffle efficiency
>> >https://issues.apache.org/jira/browse/SPARK-30602;
>> >
>> > [SQL]
>> >
>> >- Support ANSI SQL INTERVAL types
>> >https://issues.apache.org/jira/browse/SPARK-27790;
>> >- Support Timestamp without time zone data type
>> >https://issues.apache.org/jira/browse/SPARK-35662;
>> >- Aggregate (Min/Max/Count) push down for Parquet
>> >https://issues.apache.org/jira/browse/SPARK-34952;
>> >
>> > [Streaming]
>> >
>> >    - EventTime based sessionization (session window)
>> >https://issues.apache.org/jira/browse/SPARK-10816;
>> >- Add RocksDB StateStore as external module
>> >https://issues.apache.org/jira/browse/SPARK-34198;
>> >
>> >
>> > I wonder whether we should postpone the branch cut date.
>> > cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
>> > Li, Liang-Chi Hsieh, who work on the projects above.
>> >
>> > On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 
>>
>> > gurwls223@
>>
>> >  wrote:
>> >
>> >> +1, thanks.
>> >>
>> >> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 
>>
>> > ltnwgl@
>>
>> >  wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> As the expected release date is close,  I would like to volunteer as
>> the
>> >>> release manager for Apache Spark 3.2.0.
>> >>>
>> >>> Thanks,
>> >>> Gengliang
>> >>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh
First, thanks for being volunteer as the release manager of Spark 3.2.0,
Gengliang!

And yes, for the two important Structured Streaming features, RocksDB
StateStore and session window, we're working on them and expect to have them
in the new release.

So I propose to postpone the branch cut date.

Thank you!

Liang-Chi


Gengliang Wang-2 wrote
> Thanks, Hyukjin.
> 
> The expected target branch cut date of Spark 3.2 is *July 1st* on
> https://spark.apache.org/versioning-policy.html. However, I notice that
> there are still multiple important projects in progress now:
> 
> [Core]
> 
>- SPIP: Support push-based shuffle to improve shuffle efficiency
>https://issues.apache.org/jira/browse/SPARK-30602;
> 
> [SQL]
> 
>- Support ANSI SQL INTERVAL types
>https://issues.apache.org/jira/browse/SPARK-27790;
>- Support Timestamp without time zone data type
>https://issues.apache.org/jira/browse/SPARK-35662;
>- Aggregate (Min/Max/Count) push down for Parquet
>https://issues.apache.org/jira/browse/SPARK-34952;
> 
> [Streaming]
> 
>- EventTime based sessionization (session window)
>https://issues.apache.org/jira/browse/SPARK-10816;
>- Add RocksDB StateStore as external module
>https://issues.apache.org/jira/browse/SPARK-34198;
> 
> 
> I wonder whether we should postpone the branch cut date.
> cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
> Li, Liang-Chi Hsieh, who work on the projects above.
> 
> On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 

> gurwls223@

>  wrote:
> 
>> +1, thanks.
>>
>> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 

> ltnwgl@

>  wrote:
>>
>>> Hi,
>>>
>>> As the expected release date is close,  I would like to volunteer as the
>>> release manager for Apache Spark 3.2.0.
>>>
>>> Thanks,
>>> Gengliang
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.0.3 Release?

2021-06-08 Thread Liang-Chi Hsieh


+1. Thank you!

Liang-Chi


Dongjoon Hyun-2 wrote
> +1, Thank you! :)
> 
> Bests,
> Dongjoon.
> 
> On Tue, Jun 8, 2021 at 9:05 PM Kent Yao 

> yaooqinn@

>  wrote:
> 
>> +1. Thanks, Yi ~
>>
>> Bests,
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi https://github.com/yaooqinn/kyuubiis a unified
>> multi-tenant JDBC
>> interface for large-scale data processing and analytics, built on top
>> of Apache Spark http://spark.apache.org/.*
>> *spark-authorizer https://github.com/yaooqinn/spark-authorizerA
>> Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark http://spark.apache.org/.*
>> *spark-postgres https://github.com/yaooqinn/spark-postgres; A
>> library for
>> reading data from and transferring data to Postgres / Greenplum with
>> Spark
>> SQL and DataFrames, 10~100x faster.*
>> *itatchi https://github.com/yaooqinn/spark-func-extrasA** library
>> t**hat
>> brings useful functions from various modern database management systems
>> to **Apache
>> Spark http://spark.apache.org/.*
>>
>>
>>
>> On 06/9/2021 11:54,Takeshi Yamamuro

> linguin.m.s@

> 
>> 

> linguin.m.s@

>  wrote:
>>
>> +1. Thank you, Yi ~
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Jun 9, 2021 at 12:18 PM Mridul Muralidharan 

> mridul@

> 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Regards,
>>> Mridul





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Liang-Chi Hsieh
Thank you, Dongjoon!



Takeshi Yamamuro wrote
> Thank you, Dongjoon!
> 
> On Wed, Jun 2, 2021 at 2:29 PM Xiao Li 

> lixiao@

>  wrote:
> 
>> Thank you!
>>
>> Xiao
>>
>> On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon 

> gurwls223@

>  wrote:
>>
>>> awesome!
>>>
>>> 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 

> dongjoon.hyun@

> 님이 작성:
>>>
 We are happy to announce the availability of Spark 3.1.2!

 Spark 3.1.2 is a maintenance release containing stability fixes. This
 release is based on the branch-3.1 maintenance branch of Spark. We
 strongly
 recommend all 3.1 users to upgrade to this stable release.

 To download Spark 3.1.2, head over to the download page:
 https://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-1-2.html

 We would like to acknowledge all community members for contributing to
 this
 release. This release would not have been possible without you.

 Dongjoon Hyun

>>>
>>
>> --
>>
>>
> 
> -- 
> ---
> Takeshi Yamamuro





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread Liang-Chi Hsieh
+1 (non-binding)

Binary and doc looks good. JIRA tickets looks good. Ran simple tasks.

Thank you, Dongjoon!


Hyukjin Kwon wrote
> +1
> 
> 2021년 5월 26일 (수) 오전 9:00, Cheng Su 

> chengsu@.com

> 님이 작성:
> 
>> +1 (non-binding)
>>
>>
>>
>> Checked the related commits in commit history manually.
>>
>>
>>
>> Thanks!
>>
>> Cheng Su
>>
>>
>>
>> *From: *Takeshi Yamamuro 

> linguin.m.s@

> 
>> *Date: *Tuesday, May 25, 2021 at 4:47 PM
>> *To: *Dongjoon Hyun 

> dongjoon.hyun@

> , dev 

> dev@.apache

> 
>> *Subject: *Re: [VOTE] Release Spark 3.1.2 (RC1)
>>
>>
>>
>> +1 (non-binding)
>>
>>
>>
>> I ran the tests, checked the related jira tickets, and compared TPCDS
>> performance differences between
>>
>> this v3.1.2 candidate and v3.1.1.
>>
>> Everything looks fine.
>>
>>
>>
>> Thank you, Dongjoon!





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Liang-Chi Hsieh
+1 

Thanks Takeshi!


Prashant Sharma wrote
> +1
> 
> On Thu, May 20, 2021 at 7:08 PM Wenchen Fan 

> cloud0fan@

>  wrote:
> 
>> +1
>>
>> On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun 

> dongjoon.hyun@

> 
>> wrote:
>>
>>> +1.
>>>
>>> Thank you, Takeshi.
>>>
>>> On Wed, May 19, 2021 at 7:49 PM Hyukjin Kwon 

> gurwls223@

>  wrote:
>>>
 Yeah, I wanted to discuss this. I agree since 2.4.x became EOL






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Spark 2.4.8 released

2021-05-17 Thread Liang-Chi Hsieh
We are happy to announce the availability of Spark 2.4.8!

Spark 2.4.8 is a maintenance release containing stability, correctness, and
security fixes. 
This release is based on the branch-2.4 maintenance branch of Spark. We
strongly recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.8, head over to the download page:
http://spark.apache.org/downloads.html

Note that you might need to clear your browser cache or to use
`Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-8.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Liang-Chi Hsieh
+1 sounds good. Thanks Dongjoon for volunteering on this!


Liang-Chi


Dongjoon Hyun-2 wrote
> Hi, All.
> 
> Since Apache Spark 3.1.1 tag creation (Feb 21),
> new 172 patches including 9 correctness patches and 4 K8s patches arrived
> at branch-3.1.
> 
> Shall we make a new release, Apache Spark 3.1.2, as the second release at
> 3.1 line?
> I'd like to volunteer for the release manager for Apache Spark 3.1.2.
> I'm thinking about starting the first RC next week.
> 
> $ git log --oneline v3.1.1..HEAD | wc -l
>  172
> 
> # Known correctness issues
> SPARK-34534 New protocol FetchShuffleBlocks in OneForOneBlockFetcher
> lead to data loss or correctness
> SPARK-34545 PySpark Python UDF return inconsistent results when
> applying 2 UDFs with different return type to 2 columns together
> SPARK-34681 Full outer shuffled hash join when building left side
> produces wrong result
> SPARK-34719 fail if the view query has duplicated column names
> SPARK-34794 Nested higher-order functions broken in DSL
> SPARK-34829 transform_values return identical values when it's used
> with udf that returns reference type
> SPARK-34833 Apply right-padding correctly for correlated subqueries
> SPARK-35381 Fix lambda variable name issues in nested DataFrame
> functions in R APIs
> SPARK-35382 Fix lambda variable name issues in nested DataFrame
> functions in Python APIs
> 
> # Notable K8s patches since K8s GA
> SPARK-34674Close SparkContext after the Main method has finished
> SPARK-34948Add ownerReference to executor configmap to fix leakages
> SPARK-34820add apt-update before gnupg install
> SPARK-34361In case of downscaling avoid killing of executors already
> known by the scheduler backend in the pod allocator
> 
> Bests,
> Dongjoon.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE][RESULT] Release Spark 2.4.8 (RC4)

2021-05-14 Thread Liang-Chi Hsieh
The vote passes. Thanks to all who helped with the release!

 (* = binding)
+1:
- Dongjoon Hyun *
- Takeshi Yamamuro
- Maxim Gekk
- John Zhuge
- Hyukjin Kwon *
- Kent Yao
- Sean Owen *
- Kousuke Saruta
- Holden Karau *
- Wenchan Fan *
- Mridul Muralidharan *
- Ismaël Mejía

+0: None

-1: None





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
The staging repository for this release can be accessed now too:
https://repository.apache.org/content/repositories/orgapachespark-1383/

Thanks for the guidance.


Liang-Chi Hsieh wrote
> Seems it is closed now after clicking close button in the UI. 





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
Seems it is closed now after clicking close button in the UI. 


Sean Owen-2 wrote
> Is there a separate process that pushes to maven central? That's what we
> have to have in the end.
> 
> On Tue, May 11, 2021, 12:31 PM Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>> I don't know what will happens if I manually close it now.
>>
>> Not sure if the current status cause a problem? If not, maybe leave as it
>> is?
>>
>>
>> Sean Owen-2 wrote
>> > Hm, yes I see it at
>> >
>> http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee=on=index
>> > but not on keyserver.ubuntu.com for some reason.
>> > What happens if you try to close it again, perhaps even manually in the
>> UI
>> > there? I don't want to click it unless it messes up the workflow
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
Oh, I see. We cannot do release on it as it is still open status.
Okay, let me try to close it manually via UI.


Sean Owen-2 wrote
> Is there a separate process that pushes to maven central? That's what we
> have to have in the end.
> 
> On Tue, May 11, 2021, 12:31 PM Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>> I don't know what will happens if I manually close it now.
>>
>> Not sure if the current status cause a problem? If not, maybe leave as it
>> is?
>>
>>
>> Sean Owen-2 wrote
>> > Hm, yes I see it at
>> >
>> http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee=on=index
>> > but not on keyserver.ubuntu.com for some reason.
>> > What happens if you try to close it again, perhaps even manually in the
>> UI
>> > there? I don't want to click it unless it messes up the workflow
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
I don't know what will happens if I manually close it now.

Not sure if the current status cause a problem? If not, maybe leave as it
is?


Sean Owen-2 wrote
> Hm, yes I see it at
> http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee=on=index
> but not on keyserver.ubuntu.com for some reason.
> What happens if you try to close it again, perhaps even manually in the UI
> there? I don't want to click it unless it messes up the workflow





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
I did upload my public key in
https://dist.apache.org/repos/dist/dev/spark/KEYS.
I also uploaded it to public keyserver before cutting RC1.

I just also try to search the public key and can find it.



cloud0fan wrote
> [image: image.png]
> 
> I checked the log in https://repository.apache.org/#stagingRepositories,
> seems the gpg key is not uploaded to the public keyserver. Liang-Chi can
> you take a look?





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-10 Thread Liang-Chi Hsieh
Yea, I don't know why it happens.

I remember RC1 also has the same issue. But RC2 and RC3 don't.

Does it affect the RC?


John Zhuge wrote
> Got this error when browsing the staging repository:
> 
> 404 - Repository "orgapachespark-1383 (staging: open)"
> [id=orgapachespark-1383] exists but is not exposed.
>  
> John Zhuge





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version
2.4.8.

The vote is open until May 14th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.8
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.8 (try project = SPARK AND
"Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.8-rc4 (commit
163fbd2528a18bf062bddf7b7753631a12a369b5):
https://github.com/apache/spark/tree/v2.4.8-rc4

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc4-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1383/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc4-docs/

The list of bug fixes going into 2.4.8 can be found at the following URL:
https://s.apache.org/spark-v2.4.8-rc4

This release is using the release script of the tag v2.4.8-rc4.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.8?
===

The current list of open tickets targeted at 2.4.8 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.8

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: [VOTE] Release Spark 2.4.8 (RC3)

2021-05-04 Thread Liang-Chi Hsieh
Hi,

Yes, RC3 fails due to an ancient bug found during this RC.
I will cut RC4 soon.

Thank you.


Nicholas Marion wrote
> Hi,
> 
> Was it decided to fail RC3 in favor of RC4?
> 
> 
>   
>  
>  Regards, 
>  
>   
>  
>  NICHOLAS T. MARION   
>  
>  AI and Analytics Development Lead | IzODA CPO
>  
>   
>  
>   
>   
>   
>   
>   
>   
>  Phone: 1-845-433-5010 | Tie-Line: 293-5010   
>  
> IBM 
>  E-mail: 

> nmarion@.ibm

>  
>  Find me on: LinkedIn:2455
> South Rd 
>  http://www.linkedin.com/in/nicholasmarion Poughkeepie, New York
> 12601-5400 
>   
>   
>  
> United States 
>   
>   
>   
>   
>       
>   
> 
> 
> 
> 
> 
> 
> 
> 
> From: Liang-Chi Hsieh 

> viirya@

> 
> To:   

> dev@.apache

> Date: 04/30/2021 03:12 PM
> Subject:  [EXTERNAL] Re: [VOTE] Release Spark 2.4.8 (RC3)
> 
> 
> 
> Hi all,
> 
> Thanks for actively voting. Unfortunately, we found a very ancient bug
> (SPARK-35278), and the fix (
> https://github.com/apache/spark/pull/32404
>  ) is
> going to be merged soon. We may fail this RC3.
> 
> I will go to cut RC4 as soon as the fix is merged.
> 
> Thank you!
> 
> 
> 
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

> 
> 
> 
> 
> 
> 1C223424.jpg (714 bytes)
> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/31199/0/1C223424.jpg;
> 1C035910.gif (2K)
> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/31199/1/1C035910.gif;
> ecblank.gif (64 bytes)
> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/31199/2/ecblank.gif;
> graycol.gif (146 bytes)
> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/31199/3/graycol.gif;





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-30 Thread Liang-Chi Hsieh
Hi all,

Thanks for actively voting. Unfortunately, we found a very ancient bug
(SPARK-35278), and the fix (https://github.com/apache/spark/pull/32404) is
going to be merged soon. We may fail this RC3.

I will go to cut RC4 as soon as the fix is merged.

Thank you!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-04-28 Thread Liang-Chi Hsieh
I am fine with RocksDB state store as built-in state store. Actually the
proposal to have it as external module is to avoid the raised concern in the
previous effort.

The need to have it as experimental doesn't necessarily mean to have it as
external module, I think. They are two things. So I don't think the risk is
highly related to have it as external module or built-in one, except that we
have the state store as default one at the beginning. If it is not a default
one, and we explicitly mention it is an experimental feature, the risk is
not very different between an external module and built-in one. As a
built-in one just makes users easier to try it.

That said even the coming RocksDB state store has been supported for years,
I think it is safer to have it as experimental feature first as it lands to
OSS Spark.

Anyway, I think it is okay to add RocksDB state store among built-in state
stores along with HDFSBasedStateStore.

I also feel that we can just have RocksDB and replace LevelDB with RocksDB.
But this is another story.


Liang-Chi


Jungtaek Lim-2 wrote
> I think adding RocksDB state store to sql/core directly would be
> OK. Personally I also voted "either way is fine with me" against RocksDB
> state store implementation in Spark ecosystem. The overall stance hasn't
> changed, but I'd like to point out that the risk becomes quite lower than
> before, given the fact we can leverage Databricks RocksDB state store
> implementation.
> 
> I feel there were two major reasons to add RocksDB state store to external
> module;
> 
> 1. stability
> 
> Databricks RocksDB state store implementation has been supported for
> years,
> it won't require more time to incubate. We may want to review thoughtfully
> to ensure the open sourced proposal fits to the Apache Spark and still
> retains stability, but this is quite better than the previous targets to
> adopt which may not be tested in production for years.
> 
> That makes me think that we don't have to put it into external and
> consider
> it as experimental.
> 
> 2. dependency
> 
> From Yuanjian's mail, JNI library is the only dependency, which seems fine
> to add by default. We already have LevelDB as one of core dependencies and
> don't concern too much about the JNI library dependency. Probably someone
> might figure out that there are outstanding benefits on replacing LevelDB
> with RocksDB and then RocksDB can even be the one of core dependencies.
> 
> On Tue, Apr 27, 2021 at 6:41 PM Yuanjian Li 

> xyliyuanjian@

>  wrote:
> 
>> Hi all,
>>
>> Following the latest comments in SPARK-34198
>> https://issues.apache.org/jira/browse/SPARK-34198;, Databricks
>> decided
>> to donate the commercial implementation of the RocksDBStateStore.
>> Compared
>> with the original decision, there’s only one topic we want to raise again
>> for discussion: can we directly add the RockDBStateStoreProvider in the
>> sql/core module? This suggestion based on the following reasons:
>>
>>1.
>>
>>The RocksDBStateStore aims to solve the problem of the original
>>HDFSBasedStateStore, which is built-in.
>>2.
>>
>>End users can conveniently set the config to use the new
>>implementation.
>>3.
>>
>>We can set the RocksDB one as the default one in the future.
>>
>>
>> For the consideration of the dependency, I also checked the rocksdbjni we
>> might introduce. As a JNI package
>> https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/6.2.2/rocksdbjni-6.2.2.pom;,
>> it should not have any dependency conflicts with Apache Spark.
>>
>> Any suggestions are welcome!
>>
>> Best,
>>
>> Yuanjian





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version
2.4.8.

The vote is open until May 4th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.8
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.8 (try project = SPARK AND
"Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.8-rc3 (commit
e89526d2401b3a04719721c923a6f630e555e286):
https://github.com/apache/spark/tree/v2.4.8-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1377/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-docs/

The list of bug fixes going into 2.4.8 can be found at the following URL:
https://s.apache.org/spark-v2.4.8-rc3

This release is using the release script of the tag v2.4.8-rc3.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.8?
===

The current list of open tickets targeted at 2.4.8 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.8

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-15 Thread Liang-Chi Hsieh
Thanks all for voting. Unfortunately, we found a long-standing correctness
bug SPARK-35080 and 2.4 was affected too. That is said we need to drop RC2
in favor of RC3.

The fix is ready for merging at https://github.com/apache/spark/pull/32179.






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 2.4.8 (RC2)

2021-04-11 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version
2.4.8.

The vote is open until Apr 15th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.8
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.8 (try project = SPARK AND
"Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.8-rc2 (commit
a0ab27ca6b46b8e5a7ae8bb91e30546082fc551c):
https://github.com/apache/spark/tree/v2.4.8-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1373/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-docs/

The list of bug fixes going into 2.4.8 can be found at the following URL:
https://s.apache.org/spark-v2.4.8-rc2

This release is using the release script of the tag v2.4.8-rc2.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.8?
===

The current list of open tickets targeted at 2.4.8 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.8

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh


I'm working on the fix for master. I think the fix is the same for 2.4.

Okay. So I think we are in favor of RC2 and RC1 is dropped. Then I will make
the fix merged first and then prepare RC2.

Thank you.

Liang-Chi


Mridul Muralidharan wrote
> Do we have a fix for this in 3.x/master which can be backported without
> too
> much surrounding change ?
> Given we are expecting 2.4.7 to probably be the last release for 2.4, if
> we
> can fix it, that would be great.
> 
> Regards,
> Mridul





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
Thanks for voting.

After I started running the release script to cut RC1 for a while, I found a
nested column pruning bug SPARK-34963, and unfortunately it exists in 2.4.7
too. As RC1 is cut, so I continue this voting.

The bug looks corner case to me and it is not reported yet since we support
nested column pruning from 2.4. So maybe it is okay to not fix it in 2.4?




cloud0fan wrote
> +1
> 
> On Thu, Apr 8, 2021 at 9:24 AM Sean Owen 

> srowen@

>  wrote:
> 
>> Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
>> profiles enabled.
>> I still get an odd failure in the Hive versions suite, but I keep seeing
>> that in my env and think it's something odd about my setup.
>> +1
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh


Please vote on releasing the following candidate as Apache Spark version
2.4.8.

The vote is open until Apr 10th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.8
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.8 (try project = SPARK AND
"Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.8-rc1 (commit
53d37e4e17254c4cfee1abcb60c36f865b255046):
https://github.com/apache/spark/tree/v2.4.8-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1368/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc1-docs/

The list of bug fixes going into 2.4.8 can be found at the following URL:
https://s.apache.org/spark-v2.4.8-rc1

This release is using the release script of the tag v2.4.8-rc1.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.8?
===

The current list of open tickets targeted at 2.4.8 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.8

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-05 Thread Liang-Chi Hsieh


Hi Mingjia,

Thanks for fixing it. I can see it is included.

Liang-Chi


mingjia-2 wrote
> Hi, All.
> 
> I fixed  SPARK-32708
> https://issues.apache.org/jira/browse/SPARK-32708#;  
> a while ago after 2.4.7 release.
> PR:https://github.com/apache/spark/pull/29564
> 
> Since it's not listed as one of the JIRAs in Dongjoon's initial email, i'd
> like to check if it will be included in 2.4.8. Can anyone please confirm? 
> :-)
> 
> 
> Thanks,
> Mingjia
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Liang-Chi Hsieh


Thanks Hyukjin and Dongjoon! :)

Then I will start RC.


Dongjoon Hyun-2 wrote
> Given that Maven passed already with that profile and you tested locally,
> I'm +1 for staring RC.
> 
> Thanks,
> Dongjoon.
> 
> On Sun, Apr 4, 2021 at 2:24 AM Hyukjin Kwon 

> gurwls223@

>  wrote:
> 
>> I would +1for just going ahead. That looks flaky to me too.
>>
>> Thanks Langchi for driving this!
>>
>>>
>>> -
>>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>>
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Liang-Chi Hsieh
Hi devs,

Currently no open issues or ongoing issues targeting 2.4.

On QA test dashboard, only spark-branch-2.4-test-sbt-hadoop-2.6 is in red
status. The failed test is
org.apache.spark.sql.streaming.StreamingQueryManagerSuite.awaitAnyTermination
with timeout and resetTerminated. It looks a flaky test to me. It was passed
locally too.

So I'm wondering if I could directly go to cut 2.4.8 RC1 given one red
light? Or we need to re-trigger the failed Jenkins build and wait it to be
greed?


Liang-Chi



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Liang-Chi Hsieh
Congrats! Welcome!


Matei Zaharia wrote
> Hi all,
> 
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
> 
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
> 
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
> 
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-26 Thread Liang-Chi Hsieh
To update current status.

The only one remaining issue for 2.4 is:

[SPARK-34855][CORE]spark context - avoid using local lazy val for callSite

We are waiting the author to submit a PR for 2.4 branch.




Liang-Chi Hsieh wrote
> Thank you so much, Takeshi!
> 
> 
> Takeshi Yamamuro wrote
>> Hi, viirya
>> 
>> I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a
>> malformed class name error
>> on jdk8u" .
>> 
>> Bests,
>> Takeshi
>> 
>> 
>> Takeshi Yamamuro
> 
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Liang-Chi Hsieh
+1 (non-binding)


rxin wrote
> +1. Would open up a huge persona for Spark.
> 
> On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < 

> cutlerb@

>  > wrote:
> 
>> 
>> +1 (non-binding)
>> 
>> 
>> On Fri, Mar 26, 2021 at 9:49 AM Maciej < 

> mszymkiewicz@

>  > wrote:
>> 
>> 
>>> +1 (nonbinding)





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
Thank you so much, Takeshi!


Takeshi Yamamuro wrote
> Hi, viirya
> 
> I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a
> malformed class name error
> on jdk8u" .
> 
> Bests,
> Takeshi
> 
> 
> Takeshi Yamamuro





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
To update with current status.

There are three tickets targeting 2.4 that are still ongoing.

SPARK-34719: Correctly resolve the view query with duplicated column names
SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error
on jdk8u
SPARK-34726: Fix collectToPython timeouts

SPARK-34719 doesn't have PR for 2.4 yet.

SPARK-34607 and SPARK-34726 are under review. SPARK-34726 is a bit arguable
as it involves a behavior change even it is very rare case. Welcome any
suggestion on the PR if any. Thanks.



Dongjoon Hyun-2 wrote
> Thank you for the update.
> 
> +1 for your plan.
> 
> Bests,
> Dongjoon.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Liang-Chi Hsieh
>From Python developer perspective, this direction sounds making sense to me.
As pandas is almost the standard library in the related area, if PySpark
supports pandas API out of box, the usability would be in a higher level.

For maintenance cost, IIUC, there are some Spark committers in the community
of Koalas and they are pretty active. So seems we don't need to worry about
who will be interested to do the maintenance. 

It is good that it is as a separate package and does not break anything in
the existing codes. How about test code? Does it fit into PySpark test
framework?


Hyukjin Kwon wrote
> Hi all,
> 
> I would like to start the discussion on supporting pandas API layer on
> Spark.
> 
> If we have a general consensus on having it in PySpark, I will initiate
> and
> drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
> 
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> 
> I do recommend taking a quick look for blog posts and talks made for
> pandas
> on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] github fetches timing out

2021-03-10 Thread Liang-Chi Hsieh
Thanks Shane for looking at it!


shane knapp ☠ wrote
> ...and just like that, overnight the builds started successfully git
> fetching!
> 
> -- 
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] github fetches timing out

2021-03-10 Thread Liang-Chi Hsieh





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-09 Thread Liang-Chi Hsieh
I just contacted Shane and seems there is ongoing github fetches timing out
issue on Jenkins.

That being said, currently the QA test is unavailable. I guess it is unsafe
to make a release cut due to lack of reliable QA test result.

I may defer the cut until QA test comes back if no objection.

WDYT?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-09 Thread Liang-Chi Hsieh


+1 (non-binding).

Thanks for the work!


Erik Krogen wrote
> +1 from me (non-binding)
> 
> On Tue, Mar 9, 2021 at 9:27 AM huaxin gao 

> huaxin.gao11@

>  wrote:
> 
>> +1 (non-binding)





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-09 Thread Liang-Chi Hsieh
Hi devs,

I'm going to cut the branch yesterday. I'd like to share current progress. I
hit a problem during dry run of the release script. Fixed it in SPARK-34672.

The latest dry run looks okay as build, docs, publish all success. But the
last step (push the tag) has a fatal error, I'm not sure if it is due to dry
run mode?


= Creating release tag v2.4.8-rc1...
Command: /opt/spark-rm/release-tag.sh
Log file: tag.log

= Building Spark...
Command: /opt/spark-rm/release-build.sh package
Log file: build.log

= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log

= Publishing release
Command: /opt/spark-rm/release-build.sh publish-release
Log file: publish.log
fatal: Not a git repository (or any parent up to mount point /opt/spark-rm)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).


Currently the 2.4 related Jenkins QA test aren't green. Some are not test
failure but

 > git fetch --tags --progress https://github.com/apache/spark.git
+refs/heads/*:refs/remotes/origin/* # timeout=10
ERROR: Timeout after 10 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from
https://github.com/apache/spark.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:996)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1237)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1297)
at hudson.scm.SCM.checkout(SCM.java:505)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1206)
at
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1894)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:428)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags
--progress https://github.com/apache/spark.git
+refs/heads/*:refs/remotes/origin/*" returned status code 128:


But there is also one test failure:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/1221/console

I will take a look.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-08 Thread Liang-Chi Hsieh


Thank you Dongjoon.

I'm going to cut the branch now. Hopefully I can make it soon (need to get
familiar with the process as first time :) )

Liang-Chi


Dongjoon Hyun-2 wrote
> Thank you, Liang-Chi! Next Monday sounds good.
> 
> To All. Please ping Liang-Chi if you have a missed backport.
> 
> Bests,
> Dongjoon.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Liang-Chi Hsieh


Thanks all for the input.

If there is no objection, I am going to cut the branch next Monday.

Thanks.
Liang-Chi


Takeshi Yamamuro wrote
> +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
> Btw, anyone roughly know how many v2.4 users still are based on some stats
> (e.g., # of v2.4.7 downloads from the official repos)?
> Most users have started using v3.x?
> 
> On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon 

> gurwls223@

>  wrote:
> 
>> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind
>> having 2.4.9 as EOL too if that's preferred from more people.
>>
> Takeshi Yamamuro





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Liang-Chi Hsieh


Yeah, in short this is a great compromise approach and I do like to see this
proposal move forward to next step. This discussion is valuable.


Chao Sun wrote
> +1 on Dongjoon's proposal. Great to see this is getting moved forward and
> thanks everyone for the insightful discussion!
> 
> 
> 
> On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue 

> rblue@

>  wrote:
> 
>> Okay, great. I'll update the SPIP doc and call a vote in the next day or
>> two.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Liang-Chi Hsieh


Thanks Dongjoon!

+1 and I volunteer to do the release of 2.4.8 if it passes.


Liang-Chi




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Liang-Chi Hsieh
Hi devs,

Thanks for all the inputs. I think overall there are positive inputs in
Spark community about having RocksDB state store as external module. Then
let's go forward with this direction and to improve structured streaming. I
will keep update to the JIRA SPARK-34198.

Thanks all again for the inputs and discussion.

Liang-Chi Hsieh





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Liang-Chi Hsieh
Basically I think the proposal makes sense to me and I'd like to support the
SPIP as it looks like we have strong need for the important feature.

Thanks Ryan for working on this and I do also look forward to Wenchen's
implementation. Thanks for the discussion too.

Actually I think the SupportsInvoke proposed by Ryan looks a good
alternative to me. Besides Wenchen's alternative implementation, is there a
chance we also have the SupportsInvoke for comparison?


John Zhuge wrote
> Excited to see our Spark community rallying behind this important feature!
> 
> The proposal lays a solid foundation of minimal feature set with careful
> considerations for future optimizations and extensions. Can't wait to see
> it leading to more advanced functionalities like views with shared custom
> functions, function pushdown, lambda, etc. It has already borne fruit from
> the constructive collaborations in this thread. Looking forward to
> Wenchen's prototype and further discussions including the SupportsInvoke
> extension proposed by Ryan.
> 
> 
> On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 

> owen.omalley@

> 
> wrote:
> 
>> I think this proposal is a very good thing giving Spark a standard way of
>> getting to and calling UDFs.
>>
>> I like having the ScalarFunction as the API to call the UDFs. It is
>> simple, yet covers all of the polymorphic type cases well. I think it
>> would
>> also simplify using the functions in other contexts like pushing down
>> filters into the ORC & Parquet readers although there are a lot of
>> details
>> that would need to be considered there.
>>
>> .. Owen
>>
>>
>> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 

> ekrogen@.com

> 
>> wrote:
>>
>>> I agree that there is a strong need for a FunctionCatalog within Spark
>>> to
>>> provide support for shareable UDFs, as well as make movement towards
>>> more
>>> advanced functionality like views which themselves depend on UDFs, so I
>>> support this SPIP wholeheartedly.
>>>
>>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> and
>>> extensible. I generally think Wenchen's proposal is easier for a user to
>>> work with in the common case, but has greater potential for confusing
>>> and
>>> hard-to-debug behavior due to use of reflective method signature
>>> searches.
>>> The merits on both sides can hopefully be more properly examined with
>>> code,
>>> so I look forward to seeing an implementation of Wenchen's ideas to
>>> provide
>>> a more concrete comparison. I am optimistic that we will not let the
>>> debate
>>> over this point unreasonably stall the SPIP from making progress.
>>>
>>> Thank you to both Wenchen and Ryan for your detailed consideration and
>>> evaluation of these ideas!
>>> --
>>> *From:* Dongjoon Hyun 

> dongjoon.hyun@

> 
>>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>> *To:* Ryan Blue 

> blue@

> 
>>> *Cc:* Holden Karau 

> holden@

> ; Hyukjin Kwon <
>>> 

> gurwls223@

>>; Spark Dev List 

> dev@.apache

> ; Wenchen Fan
>>> 

> cloud0fan@

> 
>>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>
>>> BTW, I forgot to add my opinion explicitly in this thread because I was
>>> on the PR before this thread.
>>>
>>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>> there for almost two years.
>>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>>> the latest updated design docs and AS-IS PR.
>>>
>>> And, the rest of the progress in this thread is also very satisfying to
>>> me.
>>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>
>>> To All:
>>> Please take a look at the design doc and the PR, and give us some
>>> opinions.
>>> We really need your participation in order to make DSv2 more complete.
>>> This will unblock other DSv2 features, too.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 

> dongjoon.hyun@

> 
>>> wrote:
>>>
>>> Hi, Ryan.
>>>
>>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>>> suggested is double-checking the alternatives with the implementation to
>>> give more momentum to our discussion.
>>>
>>> Your new suggestion about optional extention also sounds like a new
>>> reasonable alternative to me.
>>>
>>> We are still discussing this topic together and I hope we can make a
>>> conclude at this time (for Apache Spark 3.2) without being stucked like
>>> last time.
>>>
>>> I really appreciate your leadership in this dicsussion and the moving
>>> direction of this discussion looks constructive to me. Let's give some
>>> time
>>> to the alternatives.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue 

> blue@

>  wrote:
>>>
>>> I don’t think we should so quickly move past the drawbacks of this
>>> approach. The problems are significant enough that using invoke is not
>>> sufficient on its own. But, I think we can add it as an optional
>>> extension
>>> to shore up the weaknesses.
>>>

Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Liang-Chi Hsieh
Thank you for the inputs! Yikun. Let's take these inputs when we are ready to
have rocksdb state store in Spark SS.


Yikun Jiang wrote
> I worked on some work about rocksdb multi-arch support and version upgrade
> on
> Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again,
> I
> want to
> give some inputs in here about rocksdb version selection from multi-arch
> support
> view. Hope it helps.
> 
> The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
> all Arm64
> related commits to 5.18.4 and release a all platforms support version.
> 
> So, from multi-arch support view, the better rocksdb version is the
> version
> since
> v6.4.6, or 5.X version is v5.18.4.
> 
> [1] https://issues.apache.org/jira/browse/STORM-3599
> [2] https://github.com/apache/kafka/pull/8284
> [3] https://issues.apache.org/jira/browse/FLINK-13598
> [4] https://github.com/facebook/rocksdb/pull/6250
> 
> Regards,
> Yikun





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Add RocksDB StateStore

2021-02-02 Thread Liang-Chi Hsieh
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-11 Thread Liang-Chi Hsieh


Thanks all for the responses!

Based on these responses, I think we can go forward with the PR. I will put
the new config in the migration guide. Please help review the PR if you have
more comments.

Thank you!


Yuanjian Li wrote
> Already +1 in the PR. It would be great to mention the new config in the
> SS
> migration guide.
> 
> Ryan Blue 

> rblue@.com

>  于2020年11月11日周三 上午7:48写道:
> 
>> +1, I agree with Tom.
>>
>> On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 

> dongjoon.hyun@

> 
>> wrote:
>>
>>> +1 for Apache Spark 3.1.0.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 

> tgraves_cs@.com

> 
>>> wrote:
>>>
>>>> +1 since its a correctness issue, I think its ok to change the behavior
>>>> to make sure the user is aware of it and let them decide.
>>>>
>>>> Tom
>>>>
>>>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>>>> 

> viirya@

>> wrote:
>>>>
>>>>
>>>> Hi devs,
>>>>
>>>> In Spark structured streaming, chained stateful operators possibly
>>>> produces
>>>> incorrect results under the global watermark. SPARK-33259
>>>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>>>> demostrating what the correctness issue could be.
>>>>
>>>> Currently we don't prevent users running such queries. Because the
>>>> possible
>>>> correctness in chained stateful operators in streaming query is not
>>>> straightforward for users. From users perspective, it will possibly be
>>>> considered as a Spark bug like SPARK-33259. It is also possible the
>>>> worse
>>>> case, users are not aware of the correctness issue and use wrong
>>>> results.
>>>>
>>>> IMO, it is better to disable such queries and let users choose to run
>>>> the
>>>> query if they understand there is such risk, instead of implicitly
>>>> running
>>>> the query and let users to find out correctness issue by themselves.
>>>>
>>>> I would like to propose to disable the streaming query with possible
>>>> correctness issue in chained stateful operators. The behavior can be
>>>> controlled by a SQL config, so if users understand the risk and still
>>>> want
>>>> to run the query, they can disable the check.
>>>>
>>>> In the PR (https://github.com/apache/spark/pull/30210), the concern I
>>>> got
>>>> for now is, this changes current behavior and by default it will break
>>>> some
>>>> existing streaming queries. But I think it is pretty easy to disable
>>>> the
>>>> check with the new config. In the PR currently there is no objection
>>>> but
>>>> suggestion to hear more voices. Please let me know if you have some
>>>> thoughts.
>>>>
>>>> Thanks.
>>>> Liang-Chi Hsieh
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>
>>>> -
>>>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-06 Thread Liang-Chi Hsieh
Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
(https://issues.apache.org/jira/browse/SPARK-33259) has an example
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Liang-Chi Hsieh
Congrats! Welcome all!


Dongjoon Hyun-2 wrote
> Welcome everyone! :D
> 
> Bests,
> Dongjoon.
> 
> On Tue, Jul 14, 2020 at 11:21 AM Xiao Li 

> lixiao@

>  wrote:
> 
>> Welcome, Dilip, Huaxin and Jungtaek!
>>
>> Xiao
>>
>> On Tue, Jul 14, 2020 at 11:02 AM Holden Karau 

> holden@

> 
>> wrote:
>>
>>> So excited to have our committer pool growing with these awesome folks,
>>> welcome y'all!
>>>
>>> On Tue, Jul 14, 2020 at 10:59 AM Driesprong, Fokko 

> fokko@

> 
>>> wrote:
>>>
 Welcome!

 Op di 14 jul. 2020 om 19:53 schreef shane knapp ☠ 

> sknapp@

> :

> welcome, all!
>
> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 

> matei.zaharia@

> 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers. Please
>> join me in welcoming them to their new roles! The new committers are:
>>
>> - Huaxin Gao
>> - Jungtaek Lim
>> - Dilip Biswal
>>
>> All three of them contributed to Spark 3.0 and we’re excited to have
>> them join the project.
>>
>> Matei and the Spark PMC
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9;
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> https://databricks.com/sparkaisummit/north-america;
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh


Just got reply from CRAN admin. It should be fixed now.


Hyukjin Kwon wrote
> Thanks, Liang-chi.
> 
> On Thu, 13 Dec 2018, 8:29 am Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
> 
>> Sorry for late. There is a malformed record at CRAN package page again.
>> I've
>> already asked CRAN admin for help. It should be fixed soon according to
>> past
>> experience.
>>
>> Related discussion will be in
>> https://issues.apache.org/jira/browse/SPARK-24152. I will post here if I
>> get
>> reply from CRAN admin.
>>
>> Thanks.
>>
>>
>> Liang-Chi Hsieh wrote
>> > Thanks for letting me know! I will look into it and ask CRAN admin for
>> > help.
>> >
>> >
>> > Hyukjin Kwon wrote
>> >> Looks it's happening again. Liang-Chi, do you mind if I ask it again?
>> >>
>> >> FYI, R 3.4 is officially deprecated as of
>> >> https://github.com/apache/spark/pull/23012
>> >> We could upgrade R version to 3.4.x in Jenkins, which deals with the
>> >> malformed(?) responses after 3.0 release.
>> >> Then, we could get rid of this problem..!
>> >>
>> >> 2018년 11월 12일 (월) 오후 1:47, Hyukjin Kwon 
>> >
>> >> gurwls223@
>> >
>> >> 님이 작성:
>> >>
>> >>> I made a PR to officially drop R prior to version 3.4 (
>> >>> https://github.com/apache/spark/pull/23012).
>> >>> The tests will probably fail for now since it produces warnings for
>> >>> using
>> >>> R 3.1.x.
>> >>>
>> >>> 2018년 11월 11일 (일) 오전 3:00, Felix Cheung 
>> >
>> >> felixcheung_m@
>> >
>> >> 님이 작성:
>> >>>
>> >>>> It’s a great point about min R version. From what I see, mostly
>> because
>> >>>> of fixes and packages support, most users of R are fairly up to
>> date?
>> >>>> So
>> >>>> perhaps 3.4 as min version is reasonable esp. for Spark 3.
>> >>>>
>> >>>> Are we getting traction with CRAN sysadmin? It seems like this has
>> been
>> >>>> broken a few times.
>> >>>>
>> >>>>
>> >>>> --
>> >>>> *From:* Liang-Chi Hsieh 
>> >
>> >> viirya@
>> >
>> >> 
>> >>>> *Sent:* Saturday, November 10, 2018 2:32 AM
>> >>>> *To:*
>> >
>> >> dev@.apache
>> >
>> >>>> *Subject:* Re: [discuss] SparkR CRAN feasibility check server
>> problem
>> >>>>
>> >>>>
>> >>>> Yeah, thanks Hyukjin Kwon for bringing this up for discussion.
>> >>>>
>> >>>> I don't know how higher versions of R are widely used across R
>> >>>> community.
>> >>>> If
>> >>>> R version 3.1.x was not very commonly used, I think we can discuss
>> to
>> >>>> upgrade minimum R version in next Spark version.
>> >>>>
>> >>>> If we ended up with not upgrading, we can discuss with CRAN sysadmin
>> to
>> >>>> fix
>> >>>> it by the service side automatically that prevents malformed R
>> packages
>> >>>> info. So we don't need to fix it manually every time.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Hyukjin Kwon wrote
>> >>>> >> Can upgrading R able to fix the issue. Is this perhaps not
>> >>>> necessarily
>> >>>> > malform but some new format for new versions perhaps?
>> >>>> > That's my guess. I am not totally sure about it tho.
>> >>>> >
>> >>>> >> Anyway we should consider upgrading R version if that fixes the
>> >>>> problem.
>> >>>> > Yea, we should. If we should, it should be more them R 3.4. Maybe
>> >>>> it's
>> >>>> > good
>> >>>> > time to start to talk about minimum R version. 3.1.x is too old.
>> It's
>> >>>> > released 4.5 years ago.
>> >>>> > R 3.4.0 is released 1.5 years ago. Considering the timing for
>> Spark
>> >>>> 3.0,
>> >>>> > deprecating lower versions, bumping up R to 3.4 might be
>> reasonable
>> >>>> > option.
>> >

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh


Sorry for late. There is a malformed record at CRAN package page again. I've
already asked CRAN admin for help. It should be fixed soon according to past
experience.

Related discussion will be in
https://issues.apache.org/jira/browse/SPARK-24152. I will post here if I get
reply from CRAN admin.

Thanks.


Liang-Chi Hsieh wrote
> Thanks for letting me know! I will look into it and ask CRAN admin for
> help.
> 
> 
> Hyukjin Kwon wrote
>> Looks it's happening again. Liang-Chi, do you mind if I ask it again?
>> 
>> FYI, R 3.4 is officially deprecated as of
>> https://github.com/apache/spark/pull/23012
>> We could upgrade R version to 3.4.x in Jenkins, which deals with the
>> malformed(?) responses after 3.0 release.
>> Then, we could get rid of this problem..!
>> 
>> 2018년 11월 12일 (월) 오후 1:47, Hyukjin Kwon 
> 
>> gurwls223@
> 
>> 님이 작성:
>> 
>>> I made a PR to officially drop R prior to version 3.4 (
>>> https://github.com/apache/spark/pull/23012).
>>> The tests will probably fail for now since it produces warnings for
>>> using
>>> R 3.1.x.
>>>
>>> 2018년 11월 11일 (일) 오전 3:00, Felix Cheung 
> 
>> felixcheung_m@
> 
>> 님이 작성:
>>>
>>>> It’s a great point about min R version. From what I see, mostly because
>>>> of fixes and packages support, most users of R are fairly up to date?
>>>> So
>>>> perhaps 3.4 as min version is reasonable esp. for Spark 3.
>>>>
>>>> Are we getting traction with CRAN sysadmin? It seems like this has been
>>>> broken a few times.
>>>>
>>>>
>>>> --
>>>> *From:* Liang-Chi Hsieh 
> 
>> viirya@
> 
>> 
>>>> *Sent:* Saturday, November 10, 2018 2:32 AM
>>>> *To:* 
> 
>> dev@.apache
> 
>>>> *Subject:* Re: [discuss] SparkR CRAN feasibility check server problem
>>>>
>>>>
>>>> Yeah, thanks Hyukjin Kwon for bringing this up for discussion.
>>>>
>>>> I don't know how higher versions of R are widely used across R
>>>> community.
>>>> If
>>>> R version 3.1.x was not very commonly used, I think we can discuss to
>>>> upgrade minimum R version in next Spark version.
>>>>
>>>> If we ended up with not upgrading, we can discuss with CRAN sysadmin to
>>>> fix
>>>> it by the service side automatically that prevents malformed R packages
>>>> info. So we don't need to fix it manually every time.
>>>>
>>>>
>>>>
>>>> Hyukjin Kwon wrote
>>>> >> Can upgrading R able to fix the issue. Is this perhaps not
>>>> necessarily
>>>> > malform but some new format for new versions perhaps?
>>>> > That's my guess. I am not totally sure about it tho.
>>>> >
>>>> >> Anyway we should consider upgrading R version if that fixes the
>>>> problem.
>>>> > Yea, we should. If we should, it should be more them R 3.4. Maybe
>>>> it's
>>>> > good
>>>> > time to start to talk about minimum R version. 3.1.x is too old. It's
>>>> > released 4.5 years ago.
>>>> > R 3.4.0 is released 1.5 years ago. Considering the timing for Spark
>>>> 3.0,
>>>> > deprecating lower versions, bumping up R to 3.4 might be reasonable
>>>> > option.
>>>> >
>>>> > Adding Shane as well.
>>>> >
>>>> > If we ended up with not upgrading it, I will forward this email to
>>>> CRAN
>>>> > sysadmin to discuss further anyway.
>>>> >
>>>> >
>>>> >
>>>> > 2018년 11월 2일 (금) 오후 12:51, Felix Cheung 
>>>>
>>>> > felixcheung@
>>>>
>>>> > 님이 작성:
>>>> >
>>>> >> Thanks for being this up and much appreciate with keeping on top of
>>>> this
>>>> >> at all times.
>>>> >>
>>>> >> Can upgrading R able to fix the issue. Is this perhaps not
>>>> necessarily
>>>> >> malform but some new format for new versions perhaps? Anyway we
>>>> should
>>>> >> consider upgrading R version if that fixes the problem.
>>>> >>
>>>> >> As an option we could also disable the repo check in Jenkins but I
>>>> can
>>>> >> see
>>>> 

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh


Thanks for letting me know! I will look into it and ask CRAN admin for help.


Hyukjin Kwon wrote
> Looks it's happening again. Liang-Chi, do you mind if I ask it again?
> 
> FYI, R 3.4 is officially deprecated as of
> https://github.com/apache/spark/pull/23012
> We could upgrade R version to 3.4.x in Jenkins, which deals with the
> malformed(?) responses after 3.0 release.
> Then, we could get rid of this problem..!
> 
> 2018년 11월 12일 (월) 오후 1:47, Hyukjin Kwon 

> gurwls223@

> 님이 작성:
> 
>> I made a PR to officially drop R prior to version 3.4 (
>> https://github.com/apache/spark/pull/23012).
>> The tests will probably fail for now since it produces warnings for using
>> R 3.1.x.
>>
>> 2018년 11월 11일 (일) 오전 3:00, Felix Cheung 

> felixcheung_m@

> 님이 작성:
>>
>>> It’s a great point about min R version. From what I see, mostly because
>>> of fixes and packages support, most users of R are fairly up to date? So
>>> perhaps 3.4 as min version is reasonable esp. for Spark 3.
>>>
>>> Are we getting traction with CRAN sysadmin? It seems like this has been
>>> broken a few times.
>>>
>>>
>>> --
>>> *From:* Liang-Chi Hsieh 

> viirya@

> 
>>> *Sent:* Saturday, November 10, 2018 2:32 AM
>>> *To:* 

> dev@.apache

>>> *Subject:* Re: [discuss] SparkR CRAN feasibility check server problem
>>>
>>>
>>> Yeah, thanks Hyukjin Kwon for bringing this up for discussion.
>>>
>>> I don't know how higher versions of R are widely used across R
>>> community.
>>> If
>>> R version 3.1.x was not very commonly used, I think we can discuss to
>>> upgrade minimum R version in next Spark version.
>>>
>>> If we ended up with not upgrading, we can discuss with CRAN sysadmin to
>>> fix
>>> it by the service side automatically that prevents malformed R packages
>>> info. So we don't need to fix it manually every time.
>>>
>>>
>>>
>>> Hyukjin Kwon wrote
>>> >> Can upgrading R able to fix the issue. Is this perhaps not
>>> necessarily
>>> > malform but some new format for new versions perhaps?
>>> > That's my guess. I am not totally sure about it tho.
>>> >
>>> >> Anyway we should consider upgrading R version if that fixes the
>>> problem.
>>> > Yea, we should. If we should, it should be more them R 3.4. Maybe it's
>>> > good
>>> > time to start to talk about minimum R version. 3.1.x is too old. It's
>>> > released 4.5 years ago.
>>> > R 3.4.0 is released 1.5 years ago. Considering the timing for Spark
>>> 3.0,
>>> > deprecating lower versions, bumping up R to 3.4 might be reasonable
>>> > option.
>>> >
>>> > Adding Shane as well.
>>> >
>>> > If we ended up with not upgrading it, I will forward this email to
>>> CRAN
>>> > sysadmin to discuss further anyway.
>>> >
>>> >
>>> >
>>> > 2018년 11월 2일 (금) 오후 12:51, Felix Cheung 
>>>
>>> > felixcheung@
>>>
>>> > 님이 작성:
>>> >
>>> >> Thanks for being this up and much appreciate with keeping on top of
>>> this
>>> >> at all times.
>>> >>
>>> >> Can upgrading R able to fix the issue. Is this perhaps not
>>> necessarily
>>> >> malform but some new format for new versions perhaps? Anyway we
>>> should
>>> >> consider upgrading R version if that fixes the problem.
>>> >>
>>> >> As an option we could also disable the repo check in Jenkins but I
>>> can
>>> >> see
>>> >> that could also be problematic.
>>> >>
>>> >>
>>> >> On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon 
>>>
>>> > gurwls223@
>>>
>>> >  wrote:
>>> >>
>>> >>> Hi all,
>>> >>>
>>> >>> I want to raise the CRAN failure issue because it started to block
>>> Spark
>>> >>> PRs time to time. Since the number
>>> >>> of PRs grows hugely in Spark community, this is critical to not
>>> block
>>> >>> other PRs.
>>> >>>
>>> >>> There has been a problem at CRAN (See
>>> >>> https://github.com/apache/spark/pull/20005 for analysis).
>>> >>> To cut

Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Liang-Chi Hsieh


Yeah, thanks Hyukjin Kwon for bringing this up for discussion.

I don't know how higher versions of R are widely used across R community. If
R version 3.1.x was not very commonly used, I think we can discuss to
upgrade minimum R version in next Spark version.

If we ended up with not upgrading, we can discuss with CRAN sysadmin to fix
it by the service side automatically that prevents malformed R packages
info. So we don't need to fix it manually every time.



Hyukjin Kwon wrote
>> Can upgrading R able to fix the issue. Is this perhaps  not necessarily
> malform but some new format for new versions perhaps?
> That's my guess. I am not totally sure about it tho.
> 
>> Anyway we should consider upgrading R version if that fixes the problem.
> Yea, we should. If we should, it should be more them R 3.4. Maybe it's
> good
> time to start to talk about minimum R version. 3.1.x is too old. It's
> released 4.5 years ago.
> R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0,
> deprecating lower versions, bumping up R to 3.4 might be reasonable
> option.
> 
> Adding Shane as well.
> 
> If we ended up with not upgrading it, I will forward this email to CRAN
> sysadmin to discuss further anyway.
> 
> 
> 
> 2018년 11월 2일 (금) 오후 12:51, Felix Cheung 

> felixcheung@

> 님이 작성:
> 
>> Thanks for being this up and much appreciate with keeping on top of this
>> at all times.
>>
>> Can upgrading R able to fix the issue. Is this perhaps  not necessarily
>> malform but some new format for new versions perhaps? Anyway we should
>> consider upgrading R version if that fixes the problem.
>>
>> As an option we could also disable the repo check in Jenkins but I can
>> see
>> that could also be problematic.
>>
>>
>> On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon 

> gurwls223@

>  wrote:
>>
>>> Hi all,
>>>
>>> I want to raise the CRAN failure issue because it started to block Spark
>>> PRs time to time. Since the number
>>> of PRs grows hugely in Spark community, this is critical to not block
>>> other PRs.
>>>
>>> There has been a problem at CRAN (See
>>> https://github.com/apache/spark/pull/20005 for analysis).
>>> To cut it short, the root cause is malformed package info from
>>> https://cran.r-project.org/src/contrib/PACKAGES
>>> from server side, and this had to be fixed by requesting it to CRAN
>>> sysaadmin's help.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24152 <- newly open. I am
>>> pretty sure it's the same issue
>>> https://issues.apache.org/jira/browse/SPARK-25923 <- reopen/resolved 2
>>> times
>>> https://issues.apache.org/jira/browse/SPARK-22812
>>>
>>> This happened 5 times for roughly about 10 months, causing blocking
>>> almost all PRs in Apache Spark.
>>> Historically, it blocked whole PRs for few days once, and whole Spark
>>> community had to stop working.
>>>
>>> I assume this has been not a super big big issue so far for other
>>> projects or other people because apparently
>>> higher version of R has some logics to handle this malformed documents
>>> (at least I verified R 3.4.0 works fine).
>>>
>>> For our side, Jenkins has low R version (R 3.1.1 if that's not updated
>>> from what I have seen before),
>>> which is unable to parse the malformed server's response.
>>>
>>> So, I want to talk about how we are going to handle this. Possible
>>> solutions are:
>>>
>>> 1. We should start a talk with CRAN sysadmin to permanently prevent this
>>> issue
>>> 2. We upgrade R to 3.4.0 in Jenkins (however we will not be able to test
>>> low R versions)
>>> 3. ...
>>>
>>> If if we fine, I would like to suggest to forward this email to CRAN
>>> sysadmin to discuss further about this.
>>>
>>> Adding Liang-Chi Felix and Shivaram who I already talked about this few
>>> times before.
>>>
>>> Thanks all.
>>>
>>>
>>>
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: welcome a new batch of committers

2018-10-03 Thread Liang-Chi Hsieh


Congratulations to all new committers!


rxin wrote
> Hi all,
> 
> The Apache Spark PMC has recently voted to add several new committers to
> the project, for their contributions:
> 
> - Shane Knapp (contributor to infra)
> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> - Kazuaki Ishizaki (contributor to Spark SQL)
> - Xingbo Jiang (contributor to Spark Core and SQL)
> - Yinan Li (contributor to Spark on Kubernetes)
> - Takeshi Yamamuro (contributor to Spark SQL)
> 
> Please join me in welcoming them!





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Liang-Chi Hsieh


Thanks for pinging me.

Seems to me we should not make assumption about the value of
spark.sql.execution.topKSortFallbackThreshold config. Once it is changed,
the global sort + limit can produce wrong result for now. I will make a PR
for this.


cloud0fan wrote
> + Liang-Chi and Herman,
> 
> I think this is a common requirement to get top N records. For now we
> guarantee it by the `TakeOrderedAndProject` operator. However, this
> operator may not be used if the
> spark.sql.execution.topKSortFallbackThreshold config has a small value.
> 
> Shall we reconsider
> https://github.com/apache/spark/commit/5c27b0d4f8d378bd7889d26fb358f478479b9996
> ? Or we don't expect users to set a small value for
> spark.sql.execution.topKSortFallbackThreshold?
> 
> 
> On Wed, Sep 5, 2018 at 11:24 AM Chetan Khatri 

> chetan.opensource@

> 
> wrote:
> 
>> Thanks
>>
>> On Wed 5 Sep, 2018, 2:15 AM Russell Spitzer, 

> russell.spitzer@

> 
>> wrote:
>>
>>> RDD: Top
>>>
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T
>>> ]
>>> Which is pretty much what Sean suggested
>>>
>>> For Dataframes I think doing a order and limit would be equivalent after
>>> optimizations.
>>>
>>> On Tue, Sep 4, 2018 at 2:28 PM Sean Owen 

> srowen@

>  wrote:
>>>
 Sort and take head(n)?

 On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <
 

> chetan.opensource@

>> wrote:

> Dear Spark dev, anything equivalent in spark ?
>






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Revisiting Online serving of Spark models?

2018-06-11 Thread Liang-Chi Hsieh


Hi,

It'd be great if there can be any sharing of the offline discussion. Thanks!



Holden Karau wrote
> We’re by the registration sign going to start walking over at 4:05
> 
> On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <

> maximilianofelice@

>> wrote:
> 
>> Hi!
>>
>> Do we meet at the entrance?
>>
>> See you
>>
>>
>> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
>> 

> nick.pentreath@

>> escribió:
>>
>>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>>>
>>> On Sun, 3 Jun 2018 at 00:24 Holden Karau 

> holden@

>  wrote:
>>>
 On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
 

> maximilianofelice@

>> wrote:

> Hi!
>
> We're already in San Francisco waiting for the summit. We even think
> that we spotted @holdenk this afternoon.
>
 Unless you happened to be walking by my garage probably not super
 likely, spent the day working on scooters/motorcycles (my style is a
 little
 less unique in SF :)). Also if you see me feel free to say hi unless I
 look
 like I haven't had my first coffee of the day, love chatting with folks
 IRL
 :)

>
> @chris, we're really interested in the Meetup you're hosting. My team
> will probably join it since the beginning of you have room for us, and
> I'll
> join it later after discussing the topics on this thread. I'll send
> you an
> email regarding this request.
>
> Thanks
>
> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> 

> sxk1969@

>> escribió:
>
>> @Chris This sounds fantastic, please send summary notes for Seattle
>> folks
>>
>> @Felix I work in downtown Seattle, am wondering if we should a tech
>> meetup around model serving in spark at my work or elsewhere close,
>> thoughts?  I’m actually in the midst of building microservices to
>> manage
>> models and when I say models I mean much more than machine learning
>> models
>> (think OR, process models as well)
>>
>> Regards
>>
>> Sent from my iPhone
>>
>> On May 31, 2018, at 10:32 PM, Chris Fregly 

> chris@

>  wrote:
>>
>> Hey everyone!
>>
>> @Felix:  thanks for putting this together.  i sent some of you a
>> quick
>> calendar event - mostly for me, so i don’t forget!  :)
>>
>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>> TensorFlow Meetup*
>> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/;
>> @5:30pm
>> on June 6th (same night) here in SF!
>>
>> Everybody is welcome to come.  Here’s the link to the meetup that
>> includes the signup link:
>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/;
>>
>> We have an awesome lineup of speakers covered a lot of deep,
>> technical
>> ground.
>>
>> For those who can’t attend in person, we’ll be broadcasting live -
>> and
>> posting the recording afterward.
>>
>> All details are in the meetup link above…
>>
>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>> welcome to give a talk. I can move things around to make room.
>>
>> @joseph:  I’d personally like an update on the direction of the
>> Databricks proprietary ML Serving export format which is similar to
>> PMML
>> but not a standard in any way.
>>
>> Also, the Databricks ML Serving Runtime is only available to
>> Databricks customers.  This seems in conflict with the community
>> efforts
>> described here.  Can you comment on behalf of Databricks?
>>
>> Look forward to your response, joseph.
>>
>> See you all soon!
>>
>> —
>>
>>
>> *Chris Fregly *Founder @ *PipelineAI* https://pipeline.ai/;
>> (100,000
>> Users)
>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/;
>> (85,000
>> Global Members)
>>
>>
>>
>> *San Francisco - Chicago - Austin -
>> Washington DC - London - Dusseldorf *
>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>> http://community.pipeline.ai/*
>>
>>
>> On May 30, 2018, at 9:32 AM, Felix Cheung 

> felixcheung_m@

> 
>> wrote:
>>
>> Hi!
>>
>> Thank you! Let’s meet then
>>
>> June 6 4pm
>>
>> Moscone West Convention Center
>> 800 Howard Street, San Francisco, CA 94103
>> https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103entry=gmailsource=g;
>>
>> Ground floor (outside of conference area - should be available for
>> all) - we will meet and decide where to go
>>
>> (Would not send invite because that would be too much noise for dev@)
>>
>> To paraphrase Joseph, we 

Re: Accessing Hive Tables in Spark

2018-04-12 Thread Liang-Chi Hsieh

Seems like Spark can't access hive-site.xml under cluster mode. One solution
is to add the config `spark.yarn.dist.files=/path/to/hive-site.xml` to your
spark-defaults.conf. And don't forget to call `enableHiveSupport()` on
`SparkSession`.


Tushar Singhal wrote
> Hi Everyone,
> 
> I was accessing Hive Tables in Spark SQL using Scala submitted by
> spark-submit command.
> When I ran in cluster mode then got error like : Table not found
> But the same is working while submitted as client mode.
> Please help me to understand why?
> 
> Distribution : Hortonworks
> 
> Thanks in advance !!





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-05 Thread Liang-Chi Hsieh
Congratulations! Zhenhua Wang



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming some new committers

2018-03-02 Thread Liang-Chi Hsieh

Congrats to everyone!


Kazuaki Ishizaki wrote
> Congratulations to everyone!
> 
> Kazuaki Ishizaki
> 
> 
> 
> From:   Takeshi Yamamuro 

> linguin.m.s@

> 
> To: Spark dev list 

> dev@.apache

> 
> Date:   2018/03/03 10:45
> Subject:Re: Welcoming some new committers
> 
> 
> 
> Congrats, all!
> 
> On Sat, Mar 3, 2018 at 10:34 AM, Takuya UESHIN 

> ueshin@

>  
> wrote:
> Congratulations and welcome!
> 
> On Sat, Mar 3, 2018 at 10:21 AM, Xingbo Jiang 

> jiangxb1987@

>  
> wrote:
> Congratulations to everyone!
> 
> 2018-03-03 8:51 GMT+08:00 Ilan Filonenko 

> if56@

> :
> Congrats to everyone! :) 
> 
> On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung 

> felixcheung_m@

>  
> wrote:
> Congrats and welcome!
> 
> 
> From: Dongjoon Hyun 

> dongjoon.hyun@

> 
> Sent: Friday, March 2, 2018 4:27:10 PM
> To: Spark dev list
> Subject: Re: Welcoming some new committers 
>  
> Congrats to all!
> 
> Bests,
> Dongjoon.
> 
> On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan 

> cloud0fan@

>  wrote:
> Congratulations to everyone and welcome!
> 
> On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger 

> cody@

>  wrote:
> Congrats to the new committers, and I appreciate the vote of confidence.
> 
> On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia 

> matei.zaharia@

>  
> wrote:
>> Hi everyone,
>>
>> The Spark PMC has recently voted to add several new committers to the 
> project, based on their contributions to Spark 2.3 and other past work:
>>
>> - Anirudh Ramanathan (contributor to Kubernetes support)
>> - Bryan Cutler (contributor to PySpark and Arrow support)
>> - Cody Koeninger (contributor to streaming and Kafka support)
>> - Erik Erlandson (contributor to Kubernetes support)
>> - Matt Cheah (contributor to Kubernetes support and other parts of 
> Spark)
>> - Seth Hendrickson (contributor to MLlib and PySpark)
>>
>> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as 
> committers!
>>
>> Matei
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

> 
> 
> 
> 
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to print plan of Structured Streaming DataFrame

2017-11-20 Thread Liang-Chi Hsieh

wordCounts.explain() -> query.explain()?


Chang Chen wrote
> Hi Guys
> 
> I modified StructuredNetworkWordCount to see what the executed plan is,
> here are my codes:
> 
> val wordCounts = words.groupBy("value").count()
> 
> // Start running the query that prints the running counts to the console
> val query = wordCounts.writeStream
>   .outputMode("complete")
>   .format("console")
>   .start()
> 
> wordCounts.explain()  // additional codes
> 
> 
> But it failed with “AnalysisException: Queries with streaming sources must
> be executed with writeStream.start()”?
> 
> 
> Thanks
> Chang





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Liang-Chi Hsieh

Congrats!


Matei Zaharia wrote
> Hi all,
> 
> The Spark PMC recently added Tejas Patil as a committer on the
> project. Tejas has been contributing across several areas of Spark for
> a while, focusing especially on scalability issues and SQL. Please
> join me in welcoming Tejas!
> 
> Matei
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Liang-Chi Hsieh
he input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >   return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >   return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Sameer Agarwal
>> > Software Engineer | Databricks Inc.
>> > http://cs.berkeley.edu/~sameerag
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Liang-Chi Hsieh
he input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >   return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >   return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Sameer Agarwal
>> > Software Engineer | Databricks Inc.
>> > http://cs.berkeley.edu/~sameerag
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.2.0 - Odd Hive SQL Warnings

2017-09-03 Thread Liang-Chi Hsieh
1.apply(ILoop.scala:909)
> at
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at
> scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
> at org.apache.spark.repl.Main$.doMain(Main.scala:70)
> at org.apache.spark.repl.Main$.main(Main.scala:53)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: at least one
> column must be specified for the table
> at
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:193)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:495)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.sql.hive.client.Shim_v0_12.alterTable(HiveShim.scala:399)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$alterTable$1.apply$mcV$sp(HiveClientImpl.scala:461)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$alterTable$1.apply(HiveClientImpl.scala:457)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$alterTable$1.apply(HiveClientImpl.scala:457)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:457)
> at
> org.apache.spark.sql.hive.client.HiveClient$class.alterTable(HiveClient.scala:87)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:79)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableSchema$1.apply$mcV$sp(HiveExternalCatalog.scala:643)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableSchema$1.apply(HiveExternalCatalog.scala:627)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableSchema$1.apply(HiveExternalCatalog.scala:627)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> ... 93 more





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Liang-Chi Hsieh

Congratulations, Jerry!


Weiqing Yang wrote
> Congratulations, Jerry!
> 
> On Mon, Aug 28, 2017 at 6:44 PM, Yanbo Liang 

> ybliang8@

>  wrote:
> 
>> Congratulations, Jerry.
>>
>> On Tue, Aug 29, 2017 at 9:42 AM, John Deng 

> mailthis@

>  wrote:
>>
>>>
>>> Congratulations, Jerry !
>>>
>>> On 8/29/2017 09:28,Matei Zaharia

> matei.zaharia@

> 
>>> 

> matei.zaharia@

>  wrote:
>>>
>>> Hi everyone,
>>>
>>> The PMC recently voted to add Saisai (Jerry) Shao as a commi
>>> tter. Saisai has been contributing to many areas of the
>>> project for a long time, so it’s great to see him join.
>>> Join me in thanking and congratulating him!
>>>
>>> Matei
>>> -
>>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>>
>>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Welcoming-Saisai-Jerry-Shao-as-a-committer-tp22241p22249.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-08 Thread Liang-Chi Hsieh

Congrats to Hyukjin and Sameer!


Xiao Li wrote
> Congrats!
> 
> 
> On Mon, 7 Aug 2017 at 10:21 PM Takuya UESHIN 

> ueshin@

>  wrote:
> 
>> Congrats!
>>
>> On Tue, Aug 8, 2017 at 11:38 AM, Felix Cheung 

> felixcheung_m@

> 
>> wrote:
>>
>>> Congrats!!
>>>
>>> --
>>> *From:* Kevin Kim (Sangwoo) 

> kevin@

> 
>>> *Sent:* Monday, August 7, 2017 7:30:01 PM
>>> *To:* Hyukjin Kwon; dev
>>> *Cc:* Bryan Cutler; Mridul Muralidharan; Matei Zaharia; Holden Karau
>>> *Subject:* Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers
>>>
>>> Thanks for all of your hard work, Hyukjin and Sameer. Congratulations!!
>>>
>>>
>>> 2017년 8월 8일 (화) 오전 9:44, Hyukjin Kwon 

> gurwls223@

> 님이 작성:
>>>
>>>> Thank you all. Will do my best!
>>>>
>>>> 2017-08-08 8:53 GMT+09:00 Holden Karau 

> holden@

> :
>>>>
>>>>> Congrats!
>>>>>
>>>>> On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler 

> cutlerb@

>  wrote:
>>>>>
>>>>>> Great work Hyukjin and Sameer!
>>>>>>
>>>>>> On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan 

> mridul@

> >>>> > wrote:
>>>>>>
>>>>>>> Congratulations Hyukjin, Sameer !
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia <
>>>>>>> 

> matei.zaharia@

>> wrote:
>>>>>>> > Hi everyone,
>>>>>>> >
>>>>>>> > The Spark PMC recently voted to add Hyukjin Kwon and Sameer
>>>>>>> Agarwal
>>>>>>> as committers. Join me in congratulating both of them and thanking
>>>>>>> them for
>>>>>>> their contributions to the project!
>>>>>>> >
>>>>>>> > Matei
>>>>>>> >
>>>>>>> -
>>>>>>> > To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>>>>>> >
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>>>>>>
>>>>>>>
>>>>>> --
>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Welcoming-Hyukjin-Kwon-and-Sameer-Agarwal-as-committers-tp22092p22109.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: the uniqueSource in StreamExecution, where is it be changed please?

2017-08-05 Thread Liang-Chi Hsieh

Not sure if you are looking for how the returned value of `getOffset`
changes.

I think it depends on how the actual `Source` classes implement it. For
example, in `FileStreamSource`, you can see `getOffset` is updated by
finding new files in the source.

Different source has different way to get its offset.



萝卜丝炒饭 wrote
> Hi all,
> 
> 
> These days I am learning the code about the StreamExecution.
> In the method constructNextBatch(about line 365), I found the value of
> latestOffsets changed but I can not find where the s.getOffset of
> uniqueSource is  changed.
> here is the code link:
> 
> 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
> 
> 
> 
> Would you like help understand it please?
> 
> 
> Thanks.
> Robin





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/the-uniqueSource-in-StreamExecution-where-is-it-be-changed-please-tp22084p22087.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Question, Flaky tests: pyspark.sql.tests.ArrowTests tests in Jenkins worker 5(?)

2017-08-05 Thread Liang-Chi Hsieh

Maybe a possible fix:
https://stackoverflow.com/questions/31495657/development-build-of-pandas-giving-importerror-c-extension-hashtable-not-bui


Hyukjin Kwon wrote
> Hi all,
> 
> I am seeing flaky Python tests time to time and if I am not mistaken
> mostly
> in amp-jenkins-worker-05:
> 
> 
> ==
> ERROR: test_filtered_frame (pyspark.sql.tests.ArrowTests)
> --
> Traceback (most recent call last):
>   File
> "/home/anaconda/envs/py3k/lib/python3.4/site-packages/pandas/__init__.py",
> line 25, in 
> 
> from pandas import hashtable, tslib, lib
> ImportError: cannot import name 'hashtable'
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py",
> line 3057, in test_filtered_frame
> pdf = df.filter("i < 0").toPandas()
>   File
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py",
> line 1727, in toPandas
> import pandas as pd
>   File
> "/home/anaconda/envs/py3k/lib/python3.4/site-packages/pandas/__init__.py",
> line 31, in 
> 
> "the C extensions first.".format(module))
> ImportError: C extension: 'hashtable' not built. If you want to import
> pandas from the source directory, you may need to run 'python setup.py
> build_ext --inplace --force' to build the C extensions first.
> 
> ==
> ERROR: test_null_conversion (pyspark.sql.tests.ArrowTests)
> --
> ...
> 
> ==
> ERROR: test_pandas_round_trip (pyspark.sql.tests.ArrowTests)
> --
> ...
> 
> ==
> ERROR: test_toPandas_arrow_toggle (pyspark.sql.tests.ArrowTests)
> --
> ...
> 
> 
> I sounds environment problem apparently due to missing hashtable (which I
> believe should have been compiled and importable properly).
> 
> I suspect few possibilities such as a bug somewhere or unsuccessful manual
> build from Pandas source but I am unable to reproduce this and check this.
> So, yes. This is rather my guess.
> 
> 
> Does anyone know if this is an environment problem and how to fix this?





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-Flaky-tests-pyspark-sql-tests-ArrowTests-tests-in-Jenkins-worker-5-tp22085p22086.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Speeding up Catalyst engine

2017-07-24 Thread Liang-Chi Hsieh

Hi Maciej,

For backportting https://issues.apache.org/jira/browse/SPARK-20392, you can
see the suggestion from committers on the PR. I think we don't expect it
will be merged into 2.2.



Maciej Bryński wrote
> Hi Everyone,
> I'm trying to speed up my Spark streaming application and I have following
> problem.
> I'm using a lot of joins in my app and full catalyst analysis is triggered
> during every join.
> 
> I found 2 options to speed up.
> 
> 1) spark.sql.selfJoinAutoResolveAmbiguity  option
> But looking at code:
> https://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918
> 
> Shouldn't lines 925-927 be before 920-922 ?
> 
> 2) https://issues.apache.org/jira/browse/SPARK-20392
> 
> Is it safe to use it on top of 2.2.0 ?
> 
> Regards,
> -- 
> Maciek Bryński





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Speeding-up-Catalyst-engine-tp22013p22014.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

Evaluation order does matter. A non-deterministic expression can change its
output due to internal state which may depend on input order.

MonotonicallyIncreasingID is an example for the stateful expression. Once
you change the row order, the evaluation results are different.



Chang Chen wrote
> I see.
> 
> Actually, it isn't about evaluation order which user can't specify. It's
> about how many times we evaluate the non-deterministic expression for the
> same row.
> 
> For example, given the SQL:
> 
> SELECT a.col1
> FROM tbl1 a
> LEFT OUTER JOIN tbl2 b
> ON
>  CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> ELSE a.col2 END
> =
>  CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 99 as string)
> ELSE b.col3 END;
> 
> I think if we exactly evaluate   join key one time for each row of a and b
> in the whole pipeline, even if the result isn't deterministic, but the
> computation is correct.
> 
> Thanks
> Chang
> 
> 
> On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> IIUC, the evaluation order of rows in Join can be different in different
>> physical operators, e.g., Sort-based and Hash-based.
>>
>> But for non-deterministic expressions, different evaluation orders change
>> results.
>>
>>
>>
>> Chang Chen wrote
>> > I see the issue. I will try https://github.com/apache/spark/pull/18652,
>> I
>> > think
>> >
>> > 1 For Join Operator, the left and right plan can't be
>> non-deterministic.
>> > 2 If  Filter can support non-deterministic, why not join condition?
>> > 3 We can't push down or project non-deterministic expression, since it
>> may
>> > change semantics.
>> >
>> > Actually, the real problem is #2. If the join condition could be
>> > non-deterministic, then we needn't insert project.
>> >
>> > Thanks
>> > Chang
>> >
>> >
>> >
>> >
>> > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 
>>
>> > jiangxb1987@
>>
>> >  wrote:
>> >
>> >> FYI there have been a related discussion here:
>> https://github.com/apache/
>> >> spark/pull/15417#discussion_r85295977
>> >>
>> >> 2017-07-17 15:44 GMT+08:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >>
>> >>> Hi All
>> >>>
>> >>> I don't understand the difference between the semantics, I found
>> Spark
>> >>> does the same thing for GroupBy non-deterministic. From Map-Reduce
>> point
>> >>> of
>> >>> view, Join is also GroupBy in essence .
>> >>>
>> >>> @Liang Chi Hsieh
>> >>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
>> >>>
>> >>> in which situation,  semantics  will be changed?
>> >>>
>> >>> Thanks
>> >>> Chang
>> >>>
>> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> > 
>> >>> wrote:
>> >>>
>> >>>>
>> >>>> Thinking about it more, I think it changes the semantics only under
>> >>>> certain
>> >>>> scenarios.
>> >>>>
>> >>>> For the example SQL query shown in previous discussion, it looks the
>> >>>> same
>> >>>> semantics.
>> >>>>
>> >>>>
>> >>>> Xiao Li wrote
>> >>>> > If the join condition is non-deterministic, pushing it down to the
>> >>>> > underlying project will change the semantics. Thus, we are unable
>> to
>> >>>> do it
>> >>>> > in PullOutNondeterministic. Users can do it manually if they do
>> not
>> >>>> care
>> >>>> > the semantics difference.
>> >>>> >
>> >>>> > Thanks,
>> >>>> >
>> >>>> > Xiao
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>> >>>>
>> >>>> > baibaichen@
>> >>>>
>> >>>> > :
>> >>>> >
>> >>>> >> It is tedious since we have lots of Hive SQL being migrated to
>> >>>> Spark.
>&g

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

IIUC, the evaluation order of rows in Join can be different in different
physical operators, e.g., Sort-based and Hash-based.

But for non-deterministic expressions, different evaluation orders change
results.



Chang Chen wrote
> I see the issue. I will try https://github.com/apache/spark/pull/18652, I
> think
> 
> 1 For Join Operator, the left and right plan can't be non-deterministic.
> 2 If  Filter can support non-deterministic, why not join condition?
> 3 We can't push down or project non-deterministic expression, since it may
> change semantics.
> 
> Actually, the real problem is #2. If the join condition could be
> non-deterministic, then we needn't insert project.
> 
> Thanks
> Chang
> 
> 
> 
> 
> On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 

> jiangxb1987@

>  wrote:
> 
>> FYI there have been a related discussion here: https://github.com/apache/
>> spark/pull/15417#discussion_r85295977
>>
>> 2017-07-17 15:44 GMT+08:00 Chang Chen 

> baibaichen@

> :
>>
>>> Hi All
>>>
>>> I don't understand the difference between the semantics, I found Spark
>>> does the same thing for GroupBy non-deterministic. From Map-Reduce point
>>> of
>>> view, Join is also GroupBy in essence .
>>>
>>> @Liang Chi Hsieh
>>> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
>>>
>>> in which situation,  semantics  will be changed?
>>>
>>> Thanks
>>> Chang
>>>
>>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 

> viirya@

> 
>>> wrote:
>>>
>>>>
>>>> Thinking about it more, I think it changes the semantics only under
>>>> certain
>>>> scenarios.
>>>>
>>>> For the example SQL query shown in previous discussion, it looks the
>>>> same
>>>> semantics.
>>>>
>>>>
>>>> Xiao Li wrote
>>>> > If the join condition is non-deterministic, pushing it down to the
>>>> > underlying project will change the semantics. Thus, we are unable to
>>>> do it
>>>> > in PullOutNondeterministic. Users can do it manually if they do not
>>>> care
>>>> > the semantics difference.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Xiao
>>>> >
>>>> >
>>>> >
>>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>>>>
>>>> > baibaichen@
>>>>
>>>> > :
>>>> >
>>>> >> It is tedious since we have lots of Hive SQL being migrated to
>>>> Spark.
>>>> >> And
>>>> >> this workaround is equivalent  to insert a Project between Join
>>>> operator
>>>> >> and its child.
>>>> >>
>>>> >> Why not do it in PullOutNondeterministic?
>>>> >>
>>>> >> Thanks
>>>> >> Chang
>>>> >>
>>>> >>
>>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>>>>
>>>> > viirya@
>>>>
>>>> >  wrote:
>>>> >>
>>>> >>>
>>>> >>> A possible workaround is to add the rand column into tbl1 with a
>>>> >>> projection
>>>> >>> before the join.
>>>> >>>
>>>> >>> SELECT a.col1
>>>> >>> FROM (
>>>> >>>   SELECT col1,
>>>> >>> CASE
>>>> >>>  WHEN col2 IS NULL
>>>> >>>THEN cast(rand(9)*1000 - 99 as string)
>>>> >>>  ELSE
>>>> >>>col2
>>>> >>> END AS col2
>>>> >>> FROM tbl1) a
>>>> >>> LEFT OUTER JOIN tbl2 b
>>>> >>> ON a.col2 = b.col3;
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Chang Chen wrote
>>>> >>> > Hi Wenchen
>>>> >>> >
>>>> >>> > Yes. We also find this error is caused by Rand. However, this is
>>>> >>> classic
>>>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>>>> Spark?
>>>> >>> >
>>>> >>> > Thanks
>>>> >>> > Chang
>>>> >>> >

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

I created a draft pull request for explaining the cases:
https://github.com/apache/spark/pull/18652



Chang Chen wrote
> Hi All
> 
> I don't understand the difference between the semantics, I found Spark
> does
> the same thing for GroupBy non-deterministic. From Map-Reduce point of
> view, Join is also GroupBy in essence .
> 
> @Liang Chi Hsieh
> https://plus.google.com/u/0/103179362592085650735?prsrc=4;
> 
> in which situation,  semantics  will be changed?
> 
> Thanks
> Chang
> 
> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> Thinking about it more, I think it changes the semantics only under
>> certain
>> scenarios.
>>
>> For the example SQL query shown in previous discussion, it looks the same
>> semantics.
>>
>>
>> Xiao Li wrote
>> > If the join condition is non-deterministic, pushing it down to the
>> > underlying project will change the semantics. Thus, we are unable to do
>> it
>> > in PullOutNondeterministic. Users can do it manually if they do not
>> care
>> > the semantics difference.
>> >
>> > Thanks,
>> >
>> > Xiao
>> >
>> >
>> >
>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >
>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>> >> And
>> >> this workaround is equivalent  to insert a Project between Join
>> operator
>> >> and its child.
>> >>
>> >> Why not do it in PullOutNondeterministic?
>> >>
>> >> Thanks
>> >> Chang
>> >>
>> >>
>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> >  wrote:
>> >>
>> >>>
>> >>> A possible workaround is to add the rand column into tbl1 with a
>> >>> projection
>> >>> before the join.
>> >>>
>> >>> SELECT a.col1
>> >>> FROM (
>> >>>   SELECT col1,
>> >>> CASE
>> >>>  WHEN col2 IS NULL
>> >>>THEN cast(rand(9)*1000 - 99 as string)
>> >>>  ELSE
>> >>>col2
>> >>> END AS col2
>> >>> FROM tbl1) a
>> >>> LEFT OUTER JOIN tbl2 b
>> >>> ON a.col2 = b.col3;
>> >>>
>> >>>
>> >>>
>> >>> Chang Chen wrote
>> >>> > Hi Wenchen
>> >>> >
>> >>> > Yes. We also find this error is caused by Rand. However, this is
>> >>> classic
>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>> Spark?
>> >>> >
>> >>> > Thanks
>> >>> > Chang
>> >>> >
>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>> >>>
>> >>> > cloud0fan@
>> >>>
>> >>> >  wrote:
>> >>> >
>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>> >>> expressions
>> >>> >> are not allowed in join condition.
>> >>> >>
>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>> >>>
>> >>> > cn_wss@
>> >>>
>> >>> >  wrote:
>> >>> >> >
>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>> >>> >> thriftserver), For
>> >>> >> > optimizing data skew, we use "case when" to handle null.
>> >>> >> > Simple sql as following:
>> >>> >> >
>> >>> >> >
>> >>> >> > SELECT a.col1
>> >>> >> > FROM tbl1 a
>> >>> >> > LEFT OUTER JOIN tbl2 b
>> >>> >> > ON
>> >>> >> > * CASE
>> >>> >> >   WHEN a.col2 IS NULL
>> >>> >> >   TNEN cast(rand(9)*1000 - 99 as
>> >>> string)
>> >>> >> >   ELSE
>> >>> >> >   a.col2 END *
>> >>> >> >   = b.col3;
>> >>> >> >
>> >>&

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh

Thinking about it more, I think it changes the semantics only under certain
scenarios.

For the example SQL query shown in previous discussion, it looks the same
semantics.


Xiao Li wrote
> If the join condition is non-deterministic, pushing it down to the
> underlying project will change the semantics. Thus, we are unable to do it
> in PullOutNondeterministic. Users can do it manually if they do not care
> the semantics difference.
> 
> Thanks,
> 
> Xiao
> 
> 
> 
> 2017-07-16 20:07 GMT-07:00 Chang Chen 

> baibaichen@

> :
> 
>> It is tedious since we have lots of Hive SQL being migrated to Spark. 
>> And
>> this workaround is equivalent  to insert a Project between Join operator
>> and its child.
>>
>> Why not do it in PullOutNondeterministic?
>>
>> Thanks
>> Chang
>>
>>
>> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 

> viirya@

>  wrote:
>>
>>>
>>> A possible workaround is to add the rand column into tbl1 with a
>>> projection
>>> before the join.
>>>
>>> SELECT a.col1
>>> FROM (
>>>   SELECT col1,
>>> CASE
>>>  WHEN col2 IS NULL
>>>THEN cast(rand(9)*1000 - 99 as string)
>>>  ELSE
>>>col2
>>> END AS col2
>>> FROM tbl1) a
>>> LEFT OUTER JOIN tbl2 b
>>> ON a.col2 = b.col3;
>>>
>>>
>>>
>>> Chang Chen wrote
>>> > Hi Wenchen
>>> >
>>> > Yes. We also find this error is caused by Rand. However, this is
>>> classic
>>> > way to solve data skew in Hive.  Is there any equivalent way in Spark?
>>> >
>>> > Thanks
>>> > Chang
>>> >
>>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>>>
>>> > cloud0fan@
>>>
>>> >  wrote:
>>> >
>>> >> It’s not about case when, but about rand(). Non-deterministic
>>> expressions
>>> >> are not allowed in join condition.
>>> >>
>>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>>>
>>> > cn_wss@
>>>
>>> >  wrote:
>>> >> >
>>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>>> >> thriftserver), For
>>> >> > optimizing data skew, we use "case when" to handle null.
>>> >> > Simple sql as following:
>>> >> >
>>> >> >
>>> >> > SELECT a.col1
>>> >> > FROM tbl1 a
>>> >> > LEFT OUTER JOIN tbl2 b
>>> >> > ON
>>> >> > * CASE
>>> >> >   WHEN a.col2 IS NULL
>>> >> >   TNEN cast(rand(9)*1000 - 99 as
>>> string)
>>> >> >   ELSE
>>> >> >   a.col2 END *
>>> >> >   = b.col3;
>>> >> >
>>> >> >
>>> >> > But I get the error:
>>> >> >
>>> >> > == Physical Plan ==
>>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>>> expressions
>>> >> are
>>> >> > only allowed in
>>> >> > Project, Filter, Aggregate or Window, found:*
>>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000
>>> AS
>>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
>>> a.`nav_tcdt`
>>> >> END
>>> >> =
>>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>>> >> (c.`cur_flag`
>>> >> =
>>> >> > 1))
>>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>>> double))
>>> as
>>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>>> (cast(nav_tcd#26
>>> as
>>> >> int)
>>> >> > = 9)) && (cur_flag#77 = 1))
>>> >> >   ;;
>>> >> > GlobalLimit 10
>>> >> > +- LocalLimit 10
>>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as string))
>>> &&
>>> >> > nav_t

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-14 Thread Liang-Chi Hsieh

A possible workaround is to add the rand column into tbl1 with a projection
before the join.

SELECT a.col1
FROM (
  SELECT col1,
CASE
 WHEN col2 IS NULL
   THEN cast(rand(9)*1000 - 99 as string)
 ELSE
   col2
END AS col2
FROM tbl1) a
LEFT OUTER JOIN tbl2 b
ON a.col2 = b.col3;



Chang Chen wrote
> Hi Wenchen
> 
> Yes. We also find this error is caused by Rand. However, this is classic
> way to solve data skew in Hive.  Is there any equivalent way in Spark?
> 
> Thanks
> Chang
> 
> On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 

> cloud0fan@

>  wrote:
> 
>> It’s not about case when, but about rand(). Non-deterministic expressions
>> are not allowed in join condition.
>>
>> > On 13 Jul 2017, at 6:43 PM, wangshuang 

> cn_wss@

>  wrote:
>> >
>> > I'm trying to execute hive sql on spark sql (Also on spark
>> thriftserver), For
>> > optimizing data skew, we use "case when" to handle null.
>> > Simple sql as following:
>> >
>> >
>> > SELECT a.col1
>> > FROM tbl1 a
>> > LEFT OUTER JOIN tbl2 b
>> > ON
>> > * CASE
>> >   WHEN a.col2 IS NULL
>> >   TNEN cast(rand(9)*1000 - 99 as string)
>> >   ELSE
>> >   a.col2 END *
>> >   = b.col3;
>> >
>> >
>> > But I get the error:
>> >
>> > == Physical Plan ==
>> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
>> are
>> > only allowed in
>> > Project, Filter, Aggregate or Window, found:*
>> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
>> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> END
>> =
>> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> (c.`cur_flag`
>> =
>> > 1))
>> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >   ;;
>> > GlobalLimit 10
>> > +- LocalLimit 10
>> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
>> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> > [date_id#7]
>> >  +- Filter (date_id#7 = 2017-07-12)
>> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >:- SubqueryAlias a
>> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> chanl_id#8L,
>> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> nav_refer_page_type_id#13,
>> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> nav_page_value#20,
>> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> nav_tcd#26,
>> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> > detl_refer_page_value#30, ... 33 more fields]
>> >+- SubqueryAlias c
>> >   +- SubqueryAlias dim_site_categ_ext
>> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [site_categ_skid#64L,
>> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> sort_seq#71L,
>> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
>> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
>> > site_categ_id#80, site_categ_parnt_id#81]
>> >
>> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
>> spark
>> > version is 2.2.0.
>> > Any help would be greatly appreciated.
>> >
>> >
>> >
>> >
>&g

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Liang-Chi Hsieh
+1


Sameer Agarwal wrote
> +1
> 
> On Mon, Jul 3, 2017 at 6:08 AM, Wenchen Fan 

> cloud0fan@

>  wrote:
> 
>> +1
>>
>> On 3 Jul 2017, at 8:22 PM, Nick Pentreath 

> nick.pentreath@

> 
>> wrote:
>>
>> +1 (binding)
>>
>> On Mon, 3 Jul 2017 at 11:53 Yanbo Liang 

> ybliang8@

>  wrote:
>>
>>> +1
>>>
>>> On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier <
>>> 

> hvanhovell@

>> wrote:
>>>
>>>> +1
>>>>
>>>> On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <
>>>> 

> ricardo.almeida@

>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn
>>>>> -Phive -Phive-thriftserver -Pscala-2.11 on
>>>>>
>>>>>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>>>>>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1 Jul 2017 02:45, "Michael Armbrust" 

> michael@

>  wrote:
>>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.2.0-rc6
>>>>> https://github.com/apache/spark/tree/v2.2.0-rc6;
>>>>> (a2c7b2133cfee7f
>>>>> a9abfaa2bfbfb637155466783)
>>>>>
>>>>> List of JIRA tickets resolved can be found with this filter
>>>>> https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0;
>>>>> .
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found
>>>>> at:
>>>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1245/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~pwendell/spark-releases/spark-
>>>>> 2.2.0-rc6-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an
>>>>> existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be
>>>>> worked on immediately. Everything else please retarget to 2.3.0 or
>>>>> 2.2.1.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
> 
> 
> -- 
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC6-tp21902p21914.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-19 Thread Liang-Chi Hsieh

I mean it is not a bug has been fixed before this feature added. Of course
kryo serializer with 2000+ partitions are working before this feature.


Koert Kuipers wrote
> If a feature added recently breaks using kryo serializer with 2000+
> partitions then how can it not be a regression? I mean I use kryo with
> more
> than 2000 partitions all the time, and it worked before. Or was I simply
> not hitting this bug because there are other conditions that also need to
> be satisfied besides kryo and 2000+ partitions?
> 
> On Jun 19, 2017 2:20 AM, "Liang-Chi Hsieh" 

> viirya@

>  wrote:
> 
> 
> I think it's not. This is a feature added recently.
> 
> 
> Hyukjin Kwon wrote
>> Is this a regression BTW? I am just curious.
>>
>> On 19 Jun 2017 1:18 pm, "Liang-Chi Hsieh" 
> 
>> viirya@
> 
>>  wrote:
>>
>> -1. When using kyro serializer and partition number is greater than 2000.
>> There seems a NPE issue needed to fix.
>>
>> SPARK-21133 https://issues.apache.org/jira/browse/SPARK-21133;
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-
>> 2-2-0-RC4-tp21677p21784.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail:
> 
>> dev-unsubscribe@.apache
> 
> 
> 
> 
> 
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-
> 2-2-0-RC4-tp21677p21786.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC4-tp21677p21790.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-19 Thread Liang-Chi Hsieh

I think it's not. This is a feature added recently.


Hyukjin Kwon wrote
> Is this a regression BTW? I am just curious.
> 
> On 19 Jun 2017 1:18 pm, "Liang-Chi Hsieh" 

> viirya@

>  wrote:
> 
> -1. When using kyro serializer and partition number is greater than 2000.
> There seems a NPE issue needed to fix.
> 
> SPARK-21133 https://issues.apache.org/jira/browse/SPARK-21133;
> 
> 
> 
> 
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-
> 2-2-0-RC4-tp21677p21784.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> 
> ---------
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC4-tp21677p21786.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-18 Thread Liang-Chi Hsieh
-1. When using kyro serializer and partition number is greater than 2000.
There seems a NPE issue needed to fix.

SPARK-21133 <https://issues.apache.org/jira/browse/SPARK-21133>  




-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC4-tp21677p21784.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Fwd: [SparkSQL] Project using NamedExpression

2017-03-28 Thread Liang-Chi Hsieh

I am not sure why you want to transform rows in the dataframe using
mapPartitions like that.

If you want to project the rows with some expressions, you can use the API
like selectExpr and let Spark SQL to resolve expressions. To resolve
expressions manually, you need to (at least) deal with a resolver, and
transform the expressions recursively with LogicalPlan.resolve API.


Aviral Agarwal wrote
> Hi ,
> Can you please point me on how to resolve the expression ?
> I was looking into LogicalPlan.Resolve expression() that takes a Partial
> Function but I am not sure how to use that.
> 
> Thanks,
> Aviral Agarwal
> 
> On Mar 24, 2017 09:20, "Liang-Chi Hsieh" 

> viirya@

>  wrote:
> 
> 
> Hi,
> 
> You need to resolve the expressions before passing into creating
> UnsafeProjection.
> 
> 
> 
> Aviral Agarwal wrote
>> Hi guys,
>>
>> I want transform Row using NamedExpression.
>>
>> Below is the code snipped that I am using :
>>
>>
>> def apply(dataFrame: DataFrame, selectExpressions:
>> java.util.List[String]): RDD[UnsafeRow] = {
>>
>> val exprArray = selectExpressions.map(s =>
>>   Column(SqlParser.parseExpression(s)).named
>> )
>>
>> val inputSchema = dataFrame.logicalPlan.output
>>
>> val transformedRDD = dataFrame.mapPartitions(
>>   iter => {
>> val project = UnsafeProjection.create(exprArray,inputSchema)
>> iter.map{
>>   row =>
>> project(InternalRow.fromSeq(row.toSeq))
>> }
>> })
>>
>> transformedRDD
>>   }
>>
>>
>> The problem is that expression becomes unevaluable :
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate
>> expression: 'a
>> at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.
>> genCode(Expression.scala:233)
>> at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.g
>> enCode(unresolved.scala:53)
>> at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
>> n$gen$2.apply(Expression.scala:106)
>> at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
>> n$gen$2.apply(Expression.scala:102)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.sql.catalyst.expressions.Expression.gen(Exp
>> ression.scala:102)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
>> text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
>> text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>> at scala.collection.mutable.ResizableArray$class.foreach(Resiza
>> bleArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
>> scala:47)
>> at scala.collection.TraversableLike$class.map(TraversableLike.
>> scala:244)
>> at
>> scala.collection.AbstractTraversable.map(Traversable.scala:105)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
>> text.generateExpressions(CodeGenerator.scala:464)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.createCode(GenerateUnsafeProjection.scala:281)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.create(GenerateUnsafeProjection.scala:324)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.create(GenerateUnsafeProjection.scala:317)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.create(GenerateUnsafeProjection.scala:32)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
>> tor.generate(CodeGenerator.scala:635)
>> at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
>> create(Projection.scala:125)
>> at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
>> create(Projection.scala:135)
>> at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
>> ScalaTransform.scala:31)
>> at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
>> ScalaTransform.scala:30)
>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
>> apply$20.apply(RDD.scala:710)
>

Re: Fwd: [SparkSQL] Project using NamedExpression

2017-03-23 Thread Liang-Chi Hsieh

Hi,

You need to resolve the expressions before passing into creating
UnsafeProjection.



Aviral Agarwal wrote
> Hi guys,
> 
> I want transform Row using NamedExpression.
> 
> Below is the code snipped that I am using :
> 
> 
> def apply(dataFrame: DataFrame, selectExpressions:
> java.util.List[String]): RDD[UnsafeRow] = {
> 
> val exprArray = selectExpressions.map(s =>
>   Column(SqlParser.parseExpression(s)).named
> )
> 
> val inputSchema = dataFrame.logicalPlan.output
> 
> val transformedRDD = dataFrame.mapPartitions(
>   iter => {
> val project = UnsafeProjection.create(exprArray,inputSchema)
> iter.map{
>   row =>
> project(InternalRow.fromSeq(row.toSeq))
> }
> })
> 
> transformedRDD
>   }
> 
> 
> The problem is that expression becomes unevaluable :
> 
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate
> expression: 'a
> at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.
> genCode(Expression.scala:233)
> at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.g
> enCode(unresolved.scala:53)
> at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
> n$gen$2.apply(Expression.scala:106)
> at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
> n$gen$2.apply(Expression.scala:102)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.sql.catalyst.expressions.Expression.gen(Exp
> ression.scala:102)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
> text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
> text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
> at scala.collection.mutable.ResizableArray$class.foreach(Resiza
> bleArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
> scala:47)
> at scala.collection.TraversableLike$class.map(TraversableLike.
> scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
> text.generateExpressions(CodeGenerator.scala:464)
> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
> safeProjection$.createCode(GenerateUnsafeProjection.scala:281)
> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
> safeProjection$.create(GenerateUnsafeProjection.scala:324)
> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
> safeProjection$.create(GenerateUnsafeProjection.scala:317)
> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
> safeProjection$.create(GenerateUnsafeProjection.scala:32)
> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
> tor.generate(CodeGenerator.scala:635)
> at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
> create(Projection.scala:125)
> at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
> create(Projection.scala:135)
> at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
> ScalaTransform.scala:31)
> at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
> ScalaTransform.scala:30)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
> apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
> apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
> DD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.sca
> la:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
> scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> 
> This might be because the Expression is unresolved.
> 
> Any help would be appreciated.
> 
> Thanks and Regards,
> Aviral Agarwal





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this me

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Liang-Chi Hsieh

Just found that you can specify number of features when loading libsvm
source:

val df = spark.read.option("numFeatures", "100").format("libsvm")



Liang-Chi Hsieh wrote
> As the libsvm format can't specify number of features, and looks like
> NaiveBayes doesn't have such parameter, if your training/testing data is
> sparse, the number of features inferred from the data files can be
> inconsistent.
> 
> We may need to fix this.
> 
> Before a fixing going into NaiveBayes, currently a workaround is to align
> the number of features between training and testing data before fitting
> the model.
> 
> jinhong lu wrote
>> After train the mode, I got the result look like this:
>> 
>> 
>>  scala>  predictionResult.show()
>> 
>> +-++++--+
>>  |label|features|   rawPrediction|
>> probability|prediction|
>> 
>> +-++++--+
>>  |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
>> 
>> 0.0|
>>  |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
>> 
>> 0.0|
>>  |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|  
>> 
>> 1.0|
>> 
>> And then, I transform() the data by these code:
>> 
>>  import org.apache.spark.ml.linalg.Vectors
>>  import org.apache.spark.ml.linalg.Vector
>>  import scala.collection.mutable
>> 
>> def lineToVector(line:String ):Vector={
>>  val seq = new mutable.Queue[(Int,Double)]
>>  val content = line.split(" ");
>>  for( s <- content){
>>val index = s.split(":")(0).toInt
>>val value = s.split(":")(1).toDouble
>> seq += ((index,value))
>>  }
>>  return Vectors.sparse(144109, seq)
>>}
>> 
>>   val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
>> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
>> =>
>> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid",
>> "features")
>>   val predictionResult = model.transform(df)
>>   predictionResult.show()
>> 
>> 
>> But I got the error look like this:
>> 
>>  Caused by: java.lang.IllegalArgumentException: requirement failed: You
>> may not write an element to index 804201 because the declared size of
>> your vector is 144109
>>   at scala.Predef$.require(Predef.scala:224)
>>   at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>>   at lineToVector(
>> 
>> :55)
>>   at $anonfun$4.apply(
>> 
>> :50)
>>   at $anonfun$4.apply(
>> 
>> :50)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>>   at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>   at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>>   at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>>   at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>>   at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>>   at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>> 
>> So I change
>> 
>>  return Vectors.sparse(144109, seq)
>> 
>> to 
>> 
>>  return Vectors.sparse(804202, seq)
>> 
>> Another error occurs:
>> 
>>  Caused by: java.lang.IllegalArgumentException: requirement failed: The
>> columns of A don't match the number of elements of x. A: 144109, x:
>> 804202
>>at scala.Predef$.require(Predef.scala:224)
>>at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
>>at
>> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
>>at org.apache.spark.ml.li

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Liang-Chi Hsieh
9
>> 31607:19
>>  0 19109:7 29705:4 123305:32
>>  0 15309:1 43005:1 108509:1
>>  1 604:1 6401:1 6503:1 15207:4 31607:40
>>  0 1807:19
>>  0 301:14 501:1 1502:14 2507:12 123305:4
>>  0 607:14 19109:460 123305:448
>>  0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48
>> 128209:1
>>  1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2
>> 27709:2 56509:8 122705:62 123305:31 124005:2
>> 
>> And then I train the model by spark:
>> 
>>  import org.apache.spark.ml.classification.NaiveBayes
>>  import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>>  import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>>  import org.apache.spark.sql.SparkSession
>> 
>>  val spark =
>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>>  val data =
>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>>  val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),
>> seed = 1234L)
>>  //val model = new NaiveBayes().fit(trainingData)
>>  val model = new
>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>>  val predictions = model.transform(testData)
>>  predictions.show()
>> 
>> 
>> OK, I have got my model by the cole above, but how can I use this model
>> to predict the classfication of other data like these:
>> 
>>  ID1 509:2 5102:4 25909:1 31709:4 121905:19
>>  ID2 800201:1
>>  ID3 116005:4
>>  ID4 800201:1
>>  ID5 19109:1  21708:1 23208:1 49809:1 88609:1
>>  ID6 800201:1
>>  ID7 43505:7 106405:7
>> 
>> I know I can use the transform() method, but how to contrust the
>> parameter for transform() method?
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> lujinhong
>> 
> 
> Thanks,
> lujinhong
> 
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21179.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to cache SparkPlan.execute for reusing?

2017-03-03 Thread Liang-Chi Hsieh

Not sure what you mean in "its parents have to reuse it by creating new
RDDs".

As SparkPlan.execute returns new RDD every time, you won't expect the cached
RDD can be reused automatically, even you reuse the SparkPlan in several
queries.

Btw, is there any existing ways to reuse SparkPlan?



summerDG wrote
> Thank you very much. The reason why the output is empty is that the query
> involves join. I forgot to mention it in the question. So even I succeed
> in caching the RDD, the following SparkPlans in the query will not reuse
> it.
> If there is a SparkPlan of the query, which has several "parent" nodes,
> its "parents" have to reuse it by creating new RDDs?





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-cache-SparkPlan-execute-for-reusing-tp21097p21100.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to cache SparkPlan.execute for reusing?

2017-03-02 Thread Liang-Chi Hsieh

Internally, in each partition of the resulting RDD[InternalRow], you will
get the same UnsafeRow when iterating the rows. Typical RDD.cache doesn't
work for it. You will get the output with the same rows. Not sure why you
get empty output.

Dataset.cache() is used for caching SQL query results. Even you really cache
RDD[InternalRow] by RDD.cache with the trick which copies the rows (with
significant performance penalty), a new query (plan) will not automatically
reuse the cached RDD, because new RDDs will be created.


summerDG wrote
> We are optimizing the Spark SQL for adaptive execution. So the SparkPlan
> maybe reused for strategy choice. But we find that  once the result of
> SparkPlan.execute, RDD[InternalRow], is cached using RDD.cache, the query
> output is empty.
> 1. How to cache the result of SparkPlan.execute?
> 2. Why is RDD.cache invalid for RDD[InternalRow]?





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-cache-SparkPlan-execute-for-reusing-tp21097p21098.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Implementation of RNN/LSTM in Spark

2017-02-28 Thread Liang-Chi Hsieh

Yeah, I'd agree with Nick.

To have an implementation of RNN/LSTM in Spark, you may need a comprehensive
abstraction of neural networks which is general enough to represent the
computation (think of Torch, Keras, Tensorflow, MXNet, Caffe, etc.), and
modify current computation engine to work with various computing units such
as GPU. I don't think we will have such thing to be in Spark in the near
future.

There are many efforts to integrate Spark and the specialized frameworks
doing well in this abstraction and parallel computation. The best approach I
think is to look at this efforts and contribute to them if possible.


Nick Pentreath wrote
> The short answer is there is none and highly unlikely to be inside of
> Spark
> MLlib any time in the near future.
> 
> The best bets are to look at other DL libraries - for JVM there is
> Deeplearning4J and BigDL (there are others but these seem to be the most
> comprehensive I have come across) - that run on Spark. Also there are
> various flavours of TensorFlow / Caffe on Spark. And of course the libs
> such as Torch, Keras, Tensorflow, MXNet, Caffe etc. Some of them have Java
> or Scala APIs and some form of Spark integration out there in the
> community
> (in varying states of development).
> 
> Integrations with Spark are a bit patchy currently but include the
> "XOnSpark" flavours mentioned above and TensorFrames (again, there may be
> others).
> 
> On Thu, 23 Feb 2017 at 14:23 n1kt0 

> nikita.balyschew@

>  wrote:
> 
>> Hi,
>> can anyone tell me what the current status about RNNs in Spark is?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementation-of-RNN-LSTM-in-Spark-tp14866p21060.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Implementation-of-RNN-LSTM-in-Spark-tp14866p21094.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-26 Thread Liang-Chi Hsieh
>   
>   
>   If you reply to this email, your message will be added 
> to the
> discussion below:
>   
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21069.html
>   
>   To start a new topic under Apache Spark Developers 
> List, email
> ml-node+s1001551n1...@n3.nabble.com 
>   To unsubscribe from Apache Spark Developers List, click here.
>   NAML





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21084.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-26 Thread Liang-Chi Hsieh
park.sql.catalyst.optimizer.InferFiltersFromConstraints$.apply(Optimizer.scala:605)
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints$.apply(Optimizer.scala:604)
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
> scala.collection.immutable.List.foreach(List.scala:381)
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
> => holding
> Monitor(org.apache.spark.sql.execution.QueryExecution@184471747})
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$2.apply(QueryExecution.scala:230)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$2.apply(QueryExecution.scala:230)
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107)
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:230)
> org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:2544)
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:2544)...
> 
> 
> The CPU usage of the driver remains 100% like this:
> 
> 
> 
> I didn't find this issue in Spark 1.6.2, what causes this in Spark 2.1.0? 
> 
> 
> Any help is greatly appreciated!
> 
> 
> Best,
> Stan
> 
> 25FEF70B@52873242.80CAAD58 (27K)
> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/21050/0/25FEF70B%4052873242.80CAAD58;





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21050p21083.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: A DataFrame cache bug

2017-02-26 Thread Liang-Chi Hsieh


Hi Gen,

I submitted a PR to fix the issue of refreshByPath:
https://github.com/apache/spark/pull/17064

Thanks.



tgbaggio wrote
> Hi, The example that I provided is not very clear. And I add a more clear
> example in jira.
> 
> Thanks
> 
> Cheers
> Gen
> 
> On Wed, Feb 22, 2017 at 3:47 PM, gen tang 

> gen.tang86@

>  wrote:
> 
>> Hi Kazuaki Ishizaki
>>
>> Thanks a lot for your help. It works. However, a more strange bug appears
>> as follows:
>>
>> import org.apache.spark.sql.DataFrame
>> import org.apache.spark.sql.SparkSession
>>
>> def f(path: String, spark: SparkSession): DataFrame = {
>>   val data = spark.read.option("mergeSchema", "true").parquet(path)
>>   println(data.count)
>>   val df = data.filter("id>10")
>>   df.cache
>>   println(df.count)
>>   val df1 = df.filter("id>11")
>>   df1.cache
>>   println(df1.count)
>>   df1
>> }
>>
>> val dir = "/tmp/test"
>> spark.range(100).write.mode("overwrite").parquet(dir)
>> spark.catalog.refreshByPath(dir)
>> f(dir, spark).count // output 88 which is correct
>>
>> spark.range(1000).write.mode("overwrite").parquet(dir)
>> spark.catalog.refreshByPath(dir)
>> f(dir, spark).count // output 88 which is incorrect
>>
>> If we move refreshByPath into f(), just before spark.read. The whole code
>> works fine.
>>
>> Any idea? Thanks a lot
>>
>> Cheers
>> Gen
>>
>>
>> On Wed, Feb 22, 2017 at 2:22 PM, Kazuaki Ishizaki 

> ISHIZAKI@.ibm

> 
>> wrote:
>>
>>> Hi,
>>> Thank you for pointing out the JIRA.
>>> I think that this JIRA suggests you to insert
>>> "spark.catalog.refreshByPath(dir)".
>>>
>>> val dir = "/tmp/test"
>>> spark.range(100).write.mode("overwrite").parquet(dir)
>>> val df = spark.read.parquet(dir)
>>> df.count // output 100 which is correct
>>> f(df).count // output 89 which is correct
>>>
>>> spark.range(1000).write.mode("overwrite").parquet(dir)
>>> spark.catalog.refreshByPath(dir)  // insert a NEW statement
>>> val df1 = spark.read.parquet(dir)
>>> df1.count // output 1000 which is correct, in fact other operation
>>> expect
>>> df1.filter("id>10") return correct result.
>>> f(df1).count // output 89 which is incorrect
>>>
>>> Regards,
>>> Kazuaki Ishizaki
>>>
>>>
>>>
>>> From:gen tang 

> gen.tang86@

> 
>>> To:

> dev@.apache

>>> Date:2017/02/22 15:02
>>> Subject:Re: A DataFrame cache bug
>>> --
>>>
>>>
>>>
>>> Hi All,
>>>
>>> I might find a related issue on jira:
>>>
>>> *https://issues.apache.org/jira/browse/SPARK-15678*
>>> https://issues.apache.org/jira/browse/SPARK-15678;
>>>
>>> This issue is closed, may be we should reopen it.
>>>
>>> Thanks
>>>
>>> Cheers
>>> Gen
>>>
>>>
>>> On Wed, Feb 22, 2017 at 1:57 PM, gen tang <*

> gen.tang86@

> *
>>> 

> gen.tang86@

> > wrote:
>>> Hi All,
>>>
>>> I found a strange bug which is related with reading data from a updated
>>> path and cache operation.
>>> Please consider the following code:
>>>
>>> import org.apache.spark.sql.DataFrame
>>>
>>> def f(data: DataFrame): DataFrame = {
>>>   val df = data.filter("id>10")
>>>   df.cache
>>>   df.count
>>>   df
>>> }
>>>
>>> f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is
>>> correct
>>> f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which
>>> is correct
>>>
>>> val dir = "/tmp/test"
>>> spark.range(100).write.mode("overwrite").parquet(dir)
>>> val df = spark.read.parquet(dir)
>>> df.count // output 100 which is correct
>>> f(df).count // output 89 which is correct
>>>
>>> spark.range(1000).write.mode("overwrite").parquet(dir)
>>> val df1 = spark.read.parquet(dir)
>>> df1.count // output 1000 which is correct, in fact other operation
>>> expect
>>> df1.filter("id>10") return correct result.
>>> f(df1).count // output 89 which is incorrect
>>>
>>> In fact when we use df1.filter("id>10"), spark will however use old
>>> cached dataFrame
>>>
>>> Any idea? Thanks a lot
>>>
>>> Cheers
>>> Gen
>>>
>>>
>>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-DataFrame-cache-bug-tp21044p21082.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to checkpoint and RDD after a stage and before reaching an action?

2017-02-05 Thread Liang-Chi Hsieh

Hi Leo,

The checkpointing of a RDD will be performed after a job using this RDD has
completed. Since you have only one job, rdd1 will only be checkpointed after
it is finished.

To checkpoint rdd1, you can simply materialize (and maybe cache it to avoid
recomputation) rdd1 (e.g., rdd1.count) after calling rdd1.checkpoint().



leo9r wrote
> Hi,
> 
> I have a 1-action job (saveAsObjectFile at the end), that includes several
> stages. One of those stages is an expensive join "rdd1.join(rdd2)". I
> would like to checkpoint rdd1 right before the join to improve the
> stability of the job. However, what I'm seeing is that the job gets
> executed all the way to the end (saveAsObjectFile) without doing any
> checkpointing, and then re-runing the computation to checkpoint rdd1 (when
> I see the files saved to the checkpoint directory). I have no issue with
> recomputing, given that I'm not caching rdd1, but the fact that the
> checkpointing of rdd1 happens after the join brings no benefit because the
> whole DAG is executed in one piece and the job fails. If that is actually
> what is happening, what would be the best approach to solve this? 
> What I'm currently doing is to manually save rdd1 to HDFS right after the
> filter in line (4) and then load it back right before the join in line
> (11). That prevents the job from failing by splitting it into 2 jobs (ie.
> 2 actions). My expectations was that rdd1.checkpoint in line (8) was going
> to have the same effect but without the hassle of manually saving and
> loading intermediate files.
> 
> ///
> 
> (1)   val rdd1 = loadData1
> (2) .map
> (3) .groupByKey
> (4) .filter
> (5)
> (6)   val rdd2 = loadData2
> (7)
> (8)   rdd1.checkpoint()
> (9)
> (10)  rdd1
> (11).join(rdd2)
> (12).saveAsObjectFile(...)
> 
> /
> 
> Thanks in advance,
> Leo





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-checkpoint-and-RDD-after-a-stage-and-before-reaching-an-action-tp20852p20862.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh

Hi Maciej,

FYI, this fix is submitted at https://github.com/apache/spark/pull/16785.


Liang-Chi Hsieh wrote
> Hi Maciej,
> 
> After looking into the details of the time spent on preparing the executed
> plan, the cause of the significant difference between 1.6 and current
> codebase when running the example, is the optimization process to generate
> constraints.
> 
> There seems few operations in generating constraints which are not
> optimized. Plus the fact the query plan grows continuously, the time spent
> on generating constraints increases more and more.
> 
> I am trying to reduce the time cost. Although not as low as 1.6 because we
> can't remove the process of generating constraints, it is significantly
> lower than current codebase (74294 ms -> 2573 ms).
> 
> 385 ms
> 107 ms
> 46 ms
> 58 ms
> 64 ms
> 105 ms
> 86 ms
> 122 ms
> 115 ms
> 114 ms
> 100 ms
> 109 ms
> 169 ms
> 196 ms
> 174 ms
> 212 ms
> 290 ms
> 254 ms
> 318 ms
> 405 ms
> 347 ms
> 443 ms
> 432 ms
> 500 ms
> 544 ms
> 619 ms
> 697 ms
> 683 ms
> 807 ms
> 802 ms
> 960 ms
> 1010 ms
> 1155 ms
> 1251 ms
> 1298 ms
> 1388 ms
> 1503 ms
> 1613 ms
> 2279 ms
> 2349 ms
> 2573 ms
> 
> Liang-Chi Hsieh wrote
>> Hi Maciej,
>> 
>> Thanks for the info you provided.
>> 
>> I tried to run the same example with 1.6 and current branch and record
>> the difference between the time cost on preparing the executed plan.
>> 
>> Current branch:
>> 
>> 292 ms   
>>   
>> 95 ms 
>> 57 ms
>> 34 ms
>> 128 ms
>> 120 ms
>> 63 ms
>> 106 ms
>> 179 ms
>> 159 ms
>> 235 ms
>> 260 ms
>> 334 ms
>> 464 ms
>> 547 ms 
>> 719 ms
>> 942 ms
>> 1130 ms
>> 1928 ms
>> 1751 ms
>> 2159 ms
>> 2767 ms
>>  ms
>> 4175 ms
>> 5106 ms
>> 6269 ms
>> 7683 ms
>> 9210 ms
>> 10931 ms
>> 13237 ms
>> 15651 ms
>> 19222 ms
>> 23841 ms
>> 26135 ms
>> 31299 ms
>> 38437 ms
>> 47392 ms
>> 51420 ms
>> 60285 ms
>> 69840 ms
>> 74294 ms
>> 
>> 1.6:
>> 
>> 3 ms
>> 4 ms
>> 10 ms
>> 4 ms
>> 17 ms
>> 8 ms
>> 12 ms
>> 21 ms
>> 15 ms
>> 15 ms
>> 19 ms
>> 23 ms
>> 28 ms
>> 28 ms
>> 58 ms
>> 39 ms
>> 43 ms
>> 61 ms
>> 56 ms
>> 60 ms
>> 81 ms
>> 73 ms
>> 100 ms
>> 91 ms
>> 96 ms
>> 116 ms
>> 111 ms
>> 140 ms
>> 127 ms
>> 142 ms
>> 148 ms
>> 165 ms
>> 171 ms
>> 198 ms
>> 200 ms
>> 233 ms
>> 237 ms
>> 253 ms
>> 256 ms
>> 271 ms
>> 292 ms
>> 452 ms
>> 
>> Although they both take more time after each iteration due to the grown
>> query plan, it is obvious that current branch takes much more time than
>> 1.6 branch. The optimizer and query planning in current branch is much
>> more complicated than 1.6.
>> zero323 wrote
>>> Hi Liang-Chi,
>>> 
>>> Thank you for your answer and PR but what I think I wasn't specific
>>> enough. In hindsight I should have illustrate this better. What really
>>> troubles me here is a pattern of growing delays. Difference between
>>> 1.6.3 (roughly 20s runtime since the first job):
>>> 
>>> 
>>> 1.6 timeline
>>> 
>>> vs 2.1.0 (45 minutes or so in a bad case):
>>> 
>>> 2.1.0 timeline
>>> 
>>> The code is just an example and it is intentionally dumb. You easily
>>> mask this with caching, or using significantly larger data sets. So I
>>> guess the question I am really interested in is - what changed between
>>> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
>>> current master) to cause this and more important, is it a feature or is
>>> it a bug? I admit, I choose a lazy path here, and didn't spend much time
>>> (yet) trying to dig deeper.
>>> 
>>> I can see a bit higher memory usage, a bit more intensive GC activity,
>>> but nothing I would really blame for this behavior, and duration of
>>> individual jobs is comparable with some favor of 2.x. Neither
>>> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
>>> fitting in 1.6 and, as far as I can tell, they still do tha

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh

Hi Maciej,

After looking into the details of the time spent on preparing the executed
plan, the cause of the significant difference between 1.6 and current
codebase when running the example, is the optimization process to generate
constraints.

There seems few operations in generating constraints which are not
optimized. Plus the fact the query plan grows continuously, the time spent
on generating constraints increases more and more.

I am trying to reduce the time cost. Although not as low as 1.6 because we
can't remove the process of generating constraints, it is significantly
lower than current codebase (74294 ms -> 2573 ms).

385 ms
107 ms
46 ms
58 ms
64 ms
105 ms
86 ms
122 ms
115 ms
114 ms
100 ms
109 ms
169 ms
196 ms
174 ms
212 ms
290 ms
254 ms
318 ms
405 ms
347 ms
443 ms
432 ms
500 ms
544 ms
619 ms
697 ms
683 ms
807 ms
802 ms
960 ms
1010 ms
1155 ms
1251 ms
1298 ms
1388 ms
1503 ms
1613 ms
2279 ms
2349 ms
2573 ms



Liang-Chi Hsieh wrote
> Hi Maciej,
> 
> Thanks for the info you provided.
> 
> I tried to run the same example with 1.6 and current branch and record the
> difference between the time cost on preparing the executed plan.
> 
> Current branch:
> 
> 292 ms
>  
> 95 ms 
> 57 ms
> 34 ms
> 128 ms
> 120 ms
> 63 ms
> 106 ms
> 179 ms
> 159 ms
> 235 ms
> 260 ms
> 334 ms
> 464 ms
> 547 ms 
> 719 ms
> 942 ms
> 1130 ms
> 1928 ms
> 1751 ms
> 2159 ms
> 2767 ms
>  ms
> 4175 ms
> 5106 ms
> 6269 ms
> 7683 ms
> 9210 ms
> 10931 ms
> 13237 ms
> 15651 ms
> 19222 ms
> 23841 ms
> 26135 ms
> 31299 ms
> 38437 ms
> 47392 ms
> 51420 ms
> 60285 ms
> 69840 ms
> 74294 ms
> 
> 1.6:
> 
> 3 ms
> 4 ms
> 10 ms
> 4 ms
> 17 ms
> 8 ms
> 12 ms
> 21 ms
> 15 ms
> 15 ms
> 19 ms
> 23 ms
> 28 ms
> 28 ms
> 58 ms
> 39 ms
> 43 ms
> 61 ms
> 56 ms
> 60 ms
> 81 ms
> 73 ms
> 100 ms
> 91 ms
> 96 ms
> 116 ms
> 111 ms
> 140 ms
> 127 ms
> 142 ms
> 148 ms
> 165 ms
> 171 ms
> 198 ms
> 200 ms
> 233 ms
> 237 ms
> 253 ms
> 256 ms
> 271 ms
> 292 ms
> 452 ms
> 
> Although they both take more time after each iteration due to the grown
> query plan, it is obvious that current branch takes much more time than
> 1.6 branch. The optimizer and query planning in current branch is much
> more complicated than 1.6.
> zero323 wrote
>> Hi Liang-Chi,
>> 
>> Thank you for your answer and PR but what I think I wasn't specific
>> enough. In hindsight I should have illustrate this better. What really
>> troubles me here is a pattern of growing delays. Difference between
>> 1.6.3 (roughly 20s runtime since the first job):
>> 
>> 
>> 1.6 timeline
>> 
>> vs 2.1.0 (45 minutes or so in a bad case):
>> 
>> 2.1.0 timeline
>> 
>> The code is just an example and it is intentionally dumb. You easily
>> mask this with caching, or using significantly larger data sets. So I
>> guess the question I am really interested in is - what changed between
>> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
>> current master) to cause this and more important, is it a feature or is
>> it a bug? I admit, I choose a lazy path here, and didn't spend much time
>> (yet) trying to dig deeper.
>> 
>> I can see a bit higher memory usage, a bit more intensive GC activity,
>> but nothing I would really blame for this behavior, and duration of
>> individual jobs is comparable with some favor of 2.x. Neither
>> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
>> fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
>> the problem doesn't look that related to the data processing part in the
>> first place.
>> 
>> 
>> On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
>>> Hi Maciej,
>>>
>>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>>
>>>
>>> Liang-Chi Hsieh wrote
>>>> Hi Maciej,
>>>>
>>>> Basically the fitting algorithm in Pipeline is an iterative operation.
>>>> Running iterative algorithm on Dataset would have RDD lineages and
>>>> query
>>>> plans that grow fast. Without cache and checkpoint, it gets slower when
>>>> the iteration number increases.
>>>>
>>>> I think it is why when you run a Pipeline with long stages, it gets
>>>> much
>>>> longer time to finish. As I think i

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh

Hi Maciej,

Thanks for the info you provided.

I tried to run the same example with 1.6 and current branch and record the
difference between the time cost on preparing the executed plan.

Current branch:

292 ms  
   
95 ms 
57 ms
34 ms
128 ms
120 ms
63 ms
106 ms
179 ms
159 ms
235 ms
260 ms
334 ms
464 ms
547 ms 
719 ms
942 ms
1130 ms
1928 ms
1751 ms
2159 ms
2767 ms
 ms
4175 ms
5106 ms
6269 ms
7683 ms
9210 ms
10931 ms
13237 ms
15651 ms
19222 ms
23841 ms
26135 ms
31299 ms
38437 ms
47392 ms
51420 ms
60285 ms
69840 ms
74294 ms

1.6:

3 ms
4 ms
10 ms
4 ms
17 ms
8 ms
12 ms
21 ms
15 ms
15 ms
19 ms
23 ms
28 ms
28 ms
58 ms
39 ms
43 ms
61 ms
56 ms
60 ms
81 ms
73 ms
100 ms
91 ms
96 ms
116 ms
111 ms
140 ms
127 ms
142 ms
148 ms
165 ms
171 ms
198 ms
200 ms
233 ms
237 ms
253 ms
256 ms
271 ms
292 ms
452 ms

Although they both take more time after each iteration due to the grown
query plan, it is obvious that current branch takes much more time than 1.6
branch. The optimizer and query planning in current branch is much more
complicated than 1.6.


zero323 wrote
> Hi Liang-Chi,
> 
> Thank you for your answer and PR but what I think I wasn't specific
> enough. In hindsight I should have illustrate this better. What really
> troubles me here is a pattern of growing delays. Difference between
> 1.6.3 (roughly 20s runtime since the first job):
> 
> 
> 1.6 timeline
> 
> vs 2.1.0 (45 minutes or so in a bad case):
> 
> 2.1.0 timeline
> 
> The code is just an example and it is intentionally dumb. You easily
> mask this with caching, or using significantly larger data sets. So I
> guess the question I am really interested in is - what changed between
> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
> current master) to cause this and more important, is it a feature or is
> it a bug? I admit, I choose a lazy path here, and didn't spend much time
> (yet) trying to dig deeper.
> 
> I can see a bit higher memory usage, a bit more intensive GC activity,
> but nothing I would really blame for this behavior, and duration of
> individual jobs is comparable with some favor of 2.x. Neither
> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
> fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
> the problem doesn't look that related to the data processing part in the
> first place.
> 
> 
> On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
>> Hi Maciej,
>>
>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>
>>
>> Liang-Chi Hsieh wrote
>>> Hi Maciej,
>>>
>>> Basically the fitting algorithm in Pipeline is an iterative operation.
>>> Running iterative algorithm on Dataset would have RDD lineages and query
>>> plans that grow fast. Without cache and checkpoint, it gets slower when
>>> the iteration number increases.
>>>
>>> I think it is why when you run a Pipeline with long stages, it gets much
>>> longer time to finish. As I think it is not uncommon to have long stages
>>> in a Pipeline, we should improve this. I will submit a PR for this.
>>> zero323 wrote
>>>> Hi everyone,
>>>>
>>>> While experimenting with ML pipelines I experience a significant
>>>> performance regression when switching from 1.6.x to 2.x.
>>>>
>>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>>> VectorAssembler}
>>>>
>>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>>   .setInputCol(c)
>>>>   .setOutputCol(s"${c}_indexed")
>>>>   .setHandleInvalid("skip"))
>>>>
>>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>>   .setInputCol(indexer.getOutputCol)
>>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>>   .setDropLast(true))
>>>>
>>>> val assembler = new
>>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>>
>>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>>
>>>> Task execution time is comparable and executors

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh

Thanks Nick for pointing it out. I totally agreed.

In 1.6 codebase, actually Pipeline uses DataFrame instead of Dataset,
because they are not merged yet in 1.6.

In StringIndexer and OneHotEncoder, they have called ".rdd" on the Dataset,
this would deserialize the rows.

In 1.6, as they use DataFrame, there is no extra cost for deserialization.

I think this would cause some regression. As Maciej didn't show how much
performance regression observed, I can't judge if this is the root cause for
it. But this is the initial idea after I check 1.6 and current Pipeline.



Nick Pentreath wrote
> Hi Maciej
> 
> If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then
> that seems to point to some other underlying issue as the root cause.
> 
> Even though adding checkpointing should help, we should understand why
> it's
> different between 1.6 and 2.0?
> 
> 
> On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>>
>> Hi Maciej,
>>
>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>
>>
>> Liang-Chi Hsieh wrote
>> > Hi Maciej,
>> >
>> > Basically the fitting algorithm in Pipeline is an iterative operation.
>> > Running iterative algorithm on Dataset would have RDD lineages and
>> query
>> > plans that grow fast. Without cache and checkpoint, it gets slower when
>> > the iteration number increases.
>> >
>> > I think it is why when you run a Pipeline with long stages, it gets
>> much
>> > longer time to finish. As I think it is not uncommon to have long
>> stages
>> > in a Pipeline, we should improve this. I will submit a PR for this.
>> > zero323 wrote
>> >> Hi everyone,
>> >>
>> >> While experimenting with ML pipelines I experience a significant
>> >> performance regression when switching from 1.6.x to 2.x.
>> >>
>> >> import org.apache.spark.ml.{Pipeline, PipelineStage}
>> >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>> >> VectorAssembler}
>> >>
>> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>> >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>> >> val indexers = df.columns.tail.map(c => new StringIndexer()
>> >>   .setInputCol(c)
>> >>   .setOutputCol(s"${c}_indexed")
>> >>   .setHandleInvalid("skip"))
>> >>
>> >> val encoders = indexers.map(indexer => new OneHotEncoder()
>> >>   .setInputCol(indexer.getOutputCol)
>> >>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>> >>   .setDropLast(true))
>> >>
>> >> val assembler = new
>> >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>> >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>> >>
>> >> new Pipeline().setStages(stages).fit(df).transform(df).show
>> >>
>> >> Task execution time is comparable and executors are most of the time
>> >> idle so it looks like it is a problem with the optimizer. Is it a
>> known
>> >> issue? Are there any changes I've missed, that could lead to this
>> >> behavior?
>> >>
>> >> --
>> >> Best,
>> >> Maciej
>> >>
>> >>
>> >> -
>> >> To unsubscribe e-mail:
>>
>> >> dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20825.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: approx_percentile computation

2017-02-01 Thread Liang-Chi Hsieh

Hi,

You don't need to run approxPercentile against a list. Since it is an
aggregation function, you can simply run:

// Just for illustrate the idea.
val approxPercentile = new ApproximatePercentile(v1, Literal(percentage))
val agg_approx_percentile = Column(approxPercentile.toAggregateExpression())

df.groupBy (k1, k2, k3).agg(collect_list(v1), agg_approx_percentile)



Rishi wrote
> I need to compute have a spark quantiles on a numeric field after a group
> by operation. Is there a way to apply the approxPercentile on an
> aggregated list instead of a column?
> 
> E.g. The Dataframe looks like
> 
> k1 | k2 | k3 | v1
> 
> a1 | b1 | c1 | 879
> 
> a2 | b2 | c2 | 769
> 
> a1 | b1 | c1 | 129
> 
> a2 | b2 | c2 | 323
> I need to first run groupBy (k1, k2, k3) and collect_list(v1), and then
> compute quantiles [10th, 50th...] on list of v1's





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/approx-percentile-computation-tp20820p20823.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Liang-Chi Hsieh

Hi Maciej,

FYI, the PR is at https://github.com/apache/spark/pull/16775.


Liang-Chi Hsieh wrote
> Hi Maciej,
> 
> Basically the fitting algorithm in Pipeline is an iterative operation.
> Running iterative algorithm on Dataset would have RDD lineages and query
> plans that grow fast. Without cache and checkpoint, it gets slower when
> the iteration number increases.
> 
> I think it is why when you run a Pipeline with long stages, it gets much
> longer time to finish. As I think it is not uncommon to have long stages
> in a Pipeline, we should improve this. I will submit a PR for this.
> zero323 wrote
>> Hi everyone,
>> 
>> While experimenting with ML pipelines I experience a significant
>> performance regression when switching from 1.6.x to 2.x.
>> 
>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>> VectorAssembler}
>> 
>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>   .setInputCol(c)
>>   .setOutputCol(s"${c}_indexed")
>>   .setHandleInvalid("skip"))
>> 
>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>   .setInputCol(indexer.getOutputCol)
>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>   .setDropLast(true))
>> 
>> val assembler = new
>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>> 
>> new Pipeline().setStages(stages).fit(df).transform(df).show
>> 
>> Task execution time is comparable and executors are most of the time
>> idle so it looks like it is a problem with the optimizer. Is it a known
>> issue? Are there any changes I've missed, that could lead to this
>> behavior?
>> 
>> -- 
>> Best,
>> Maciej
>> 
>> 
>> -
>> To unsubscribe e-mail: 

>> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Liang-Chi Hsieh

Hi Maciej,

Basically the fitting algorithm in Pipeline is an iterative operation.
Running iterative algorithm on Dataset would have RDD lineages and query
plans that grow fast. Without cache and checkpoint, it gets slower when the
iteration number increases.

I think it is why when you run a Pipeline with long stages, it gets much
longer time to finish. As I think it is not uncommon to have long stages in
a Pipeline, we should improve this. I will submit a PR for this.


zero323 wrote
> Hi everyone,
> 
> While experimenting with ML pipelines I experience a significant
> performance regression when switching from 1.6.x to 2.x.
> 
> import org.apache.spark.ml.{Pipeline, PipelineStage}
> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
> VectorAssembler}
> 
> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> val indexers = df.columns.tail.map(c => new StringIndexer()
>   .setInputCol(c)
>   .setOutputCol(s"${c}_indexed")
>   .setHandleInvalid("skip"))
> 
> val encoders = indexers.map(indexer => new OneHotEncoder()
>   .setInputCol(indexer.getOutputCol)
>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>   .setDropLast(true))
> 
> val assembler = new
> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
> 
> new Pipeline().setStages(stages).fit(df).transform(df).show
> 
> Task execution time is comparable and executors are most of the time
> idle so it looks like it is a problem with the optimizer. Is it a known
> issue? Are there any changes I've missed, that could lead to this
> behavior?
> 
> -- 
> Best,
> Maciej
> 
> 
> -
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20821.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  1   2   >