RE: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-12 Thread Liang Chi Hsieh
I’d vote my +1 first. On 2021/11/13 02:25:05 "L. C. Hsieh" wrote: > Hi all, > > I’d like to start a vote for SPIP: Row-level operations in Data Source V2. > > The proposal is to add support for executing row-level operations > such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The >

[VOTE][RESULT] SPIP: Storage Partitioned Join for Data Source V2

2021-11-02 Thread Liang Chi Hsieh
Hi all, The vote passed with the following 9 +1 votes and no -1 or +0 votes: Liang-Chi Hsieh* Russell Spitzer Dongjoon Hyun* Huaxin Gao Ryan Blue DB Tsai* Holden Karau* Cheng Su Wenchen Fan* * = binding Thank you guys all for your feedback and votes

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-18 Thread Liang-Chi Hsieh
+1. Docs looks good. Binary looks good. Ran simple test and some tpcds queries. Thanks for working on this! wuyi wrote > Please vote on releasing the following candidate as Apache Spark version > 3.0.3. > > The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC > votes are

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh
ing. > > https://github.com/apache/spark/graphs/commit-activity > > Bests, > Dongjoon. > > > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh > viirya@ > wrote: > >> First, thanks for being volunteer as the release manager of Spark 3.2.0, >&

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh
https://issues.apache.org/jira/browse/SPARK-10816; >- Add RocksDB StateStore as external module >https://issues.apache.org/jira/browse/SPARK-34198; > > > I wonder whether we should postpone the branch cut date. > cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuan

Re: Apache Spark 3.0.3 Release?

2021-06-08 Thread Liang-Chi Hsieh
+1. Thank you! Liang-Chi Dongjoon Hyun-2 wrote > +1, Thank you! :) > > Bests, > Dongjoon. > > On Tue, Jun 8, 2021 at 9:05 PM Kent Yao > yaooqinn@ > wrote: > >> +1. Thanks, Yi ~ >> >> Bests, >> *Kent Yao * >> @ Data Science Center, Hangzhou Research Institute, NetEase Corp. >> *a spark

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Liang-Chi Hsieh
Thank you, Dongjoon! Takeshi Yamamuro wrote > Thank you, Dongjoon! > > On Wed, Jun 2, 2021 at 2:29 PM Xiao Li > lixiao@ > wrote: > >> Thank you! >> >> Xiao >> >> On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon > gurwls223@ > wrote: >> >>> awesome! >>> >>> 2021년 6월 2일 (수) 오전 9:59, Dongjoon

Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread Liang-Chi Hsieh
+1 (non-binding) Binary and doc looks good. JIRA tickets looks good. Ran simple tasks. Thank you, Dongjoon! Hyukjin Kwon wrote > +1 > > 2021년 5월 26일 (수) 오전 9:00, Cheng Su > chengsu@.com > 님이 작성: > >> +1 (non-binding) >> >> >> >> Checked the related commits in commit history manually. >>

Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Liang-Chi Hsieh
+1 Thanks Takeshi! Prashant Sharma wrote > +1 > > On Thu, May 20, 2021 at 7:08 PM Wenchen Fan > cloud0fan@ > wrote: > >> +1 >> >> On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun > dongjoon.hyun@ > >> wrote: >> >>> +1. >>> >>> Thank you, Takeshi. >>> >>> On Wed, May 19, 2021 at 7:49 PM

[ANNOUNCE] Apache Spark 2.4.8 released

2021-05-17 Thread Liang-Chi Hsieh
We are happy to announce the availability of Spark 2.4.8! Spark 2.4.8 is a maintenance release containing stability, correctness, and security fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release. To

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Liang-Chi Hsieh
+1 sounds good. Thanks Dongjoon for volunteering on this! Liang-Chi Dongjoon Hyun-2 wrote > Hi, All. > > Since Apache Spark 3.1.1 tag creation (Feb 21), > new 172 patches including 9 correctness patches and 4 K8s patches arrived > at branch-3.1. > > Shall we make a new release, Apache Spark

[VOTE][RESULT] Release Spark 2.4.8 (RC4)

2021-05-14 Thread Liang-Chi Hsieh
The vote passes. Thanks to all who helped with the release! (* = binding) +1: - Dongjoon Hyun * - Takeshi Yamamuro - Maxim Gekk - John Zhuge - Hyukjin Kwon * - Kent Yao - Sean Owen * - Kousuke Saruta - Holden Karau * - Wenchan Fan * - Mridul Muralidharan * - Ismaël Mejía +0: None -1: None

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
The staging repository for this release can be accessed now too: https://repository.apache.org/content/repositories/orgapachespark-1383/ Thanks for the guidance. Liang-Chi Hsieh wrote > Seems it is closed now after clicking close button in the UI. -- Sent from: http://apache-sp

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
Seems it is closed now after clicking close button in the UI. Sean Owen-2 wrote > Is there a separate process that pushes to maven central? That's what we > have to have in the end. > > On Tue, May 11, 2021, 12:31 PM Liang-Chi Hsieh > viirya@ > wrote: > >> I d

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
Oh, I see. We cannot do release on it as it is still open status. Okay, let me try to close it manually via UI. Sean Owen-2 wrote > Is there a separate process that pushes to maven central? That's what we > have to have in the end. > > On Tue, May 11, 2021, 12:31 PM Liang-Chi Hsie

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
I don't know what will happens if I manually close it now. Not sure if the current status cause a problem? If not, maybe leave as it is? Sean Owen-2 wrote > Hm, yes I see it at > http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee=on=index > but not on keyserver.ubuntu.com for

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
I did upload my public key in https://dist.apache.org/repos/dist/dev/spark/KEYS. I also uploaded it to public keyserver before cutting RC1. I just also try to search the public key and can find it. cloud0fan wrote > [image: image.png] > > I checked the log in

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-10 Thread Liang-Chi Hsieh
Yea, I don't know why it happens. I remember RC1 also has the same issue. But RC2 and RC3 don't. Does it affect the RC? John Zhuge wrote > Got this error when browsing the staging repository: > > 404 - Repository "orgapachespark-1383 (staging: open)" > [id=orgapachespark-1383] exists but is

[VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until May 14th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because

RE: [VOTE] Release Spark 2.4.8 (RC3)

2021-05-04 Thread Liang-Chi Hsieh
> > > United States > > > > > > > > > > > > > > > From: Liang-Chi Hsieh > viirya@ > > To: > dev@.apache > Date: 04/30/2021 03:12 PM > Subject:

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-30 Thread Liang-Chi Hsieh
Hi all, Thanks for actively voting. Unfortunately, we found a very ancient bug (SPARK-35278), and the fix (https://github.com/apache/spark/pull/32404) is going to be merged soon. We may fail this RC3. I will go to cut RC4 as soon as the fix is merged. Thank you! -- Sent from:

Re: [DISCUSS] Add RocksDB StateStore

2021-04-28 Thread Liang-Chi Hsieh
I am fine with RocksDB state store as built-in state store. Actually the proposal to have it as external module is to avoid the raised concern in the previous effort. The need to have it as experimental doesn't necessarily mean to have it as external module, I think. They are two things. So I

[VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until May 4th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-15 Thread Liang-Chi Hsieh
Thanks all for voting. Unfortunately, we found a long-standing correctness bug SPARK-35080 and 2.4 was affected too. That is said we need to drop RC2 in favor of RC3. The fix is ready for merging at https://github.com/apache/spark/pull/32179. -- Sent from:

[VOTE] Release Spark 2.4.8 (RC2)

2021-04-11 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until Apr 15th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
I'm working on the fix for master. I think the fix is the same for 2.4. Okay. So I think we are in favor of RC2 and RC1 is dropped. Then I will make the fix merged first and then prepare RC2. Thank you. Liang-Chi Mridul Muralidharan wrote > Do we have a fix for this in 3.x/master which can

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
Thanks for voting. After I started running the release script to cut RC1 for a while, I found a nested column pruning bug SPARK-34963, and unfortunately it exists in 2.4.7 too. As RC1 is cut, so I continue this voting. The bug looks corner case to me and it is not reported yet since we support

[VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until Apr 10th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-05 Thread Liang-Chi Hsieh
Hi Mingjia, Thanks for fixing it. I can see it is included. Liang-Chi mingjia-2 wrote > Hi, All. > > I fixed SPARK-32708 > https://issues.apache.org/jira/browse/SPARK-32708#; > a while ago after 2.4.7 release. > PR:https://github.com/apache/spark/pull/29564 > > Since it's not listed as

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Liang-Chi Hsieh
Thanks Hyukjin and Dongjoon! :) Then I will start RC. Dongjoon Hyun-2 wrote > Given that Maven passed already with that profile and you tested locally, > I'm +1 for staring RC. > > Thanks, > Dongjoon. > > On Sun, Apr 4, 2021 at 2:24 AM Hyukjin Kwon > gurwls223@ > wrote: > >> I would

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Liang-Chi Hsieh
Hi devs, Currently no open issues or ongoing issues targeting 2.4. On QA test dashboard, only spark-branch-2.4-test-sbt-hadoop-2.6 is in red status. The failed test is org.apache.spark.sql.streaming.StreamingQueryManagerSuite.awaitAnyTermination with timeout and resetTerminated. It looks a flaky

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Liang-Chi Hsieh
Congrats! Welcome! Matei Zaharia wrote > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor to PySpark) > - Max Gekk (contributor to Spark SQL) > - Kent Yao

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-26 Thread Liang-Chi Hsieh
To update current status. The only one remaining issue for 2.4 is: [SPARK-34855][CORE]spark context - avoid using local lazy val for callSite We are waiting the author to submit a PR for 2.4 branch. Liang-Chi Hsieh wrote > Thank you so much, Takeshi! > > > Takeshi Yamamuro

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Liang-Chi Hsieh
+1 (non-binding) rxin wrote > +1. Would open up a huge persona for Spark. > > On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < > cutlerb@ > > wrote: > >> >> +1 (non-binding) >> >> >> On Fri, Mar 26, 2021 at 9:49 AM Maciej < > mszymkiewicz@ > > wrote: >> >> >>> +1 (nonbinding)

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
Thank you so much, Takeshi! Takeshi Yamamuro wrote > Hi, viirya > > I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a > malformed class name error > on jdk8u" . > > Bests, > Takeshi > > > Takeshi Yamamuro -- Sent from:

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
To update with current status. There are three tickets targeting 2.4 that are still ongoing. SPARK-34719: Correctly resolve the view query with duplicated column names SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u SPARK-34726: Fix collectToPython timeouts

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Liang-Chi Hsieh
>From Python developer perspective, this direction sounds making sense to me. As pandas is almost the standard library in the related area, if PySpark supports pandas API out of box, the usability would be in a higher level. For maintenance cost, IIUC, there are some Spark committers in the

Re: [build system] github fetches timing out

2021-03-10 Thread Liang-Chi Hsieh
Thanks Shane for looking at it! shane knapp ☠ wrote > ...and just like that, overnight the builds started successfully git > fetching! > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu -- Sent

Re: [build system] github fetches timing out

2021-03-10 Thread Liang-Chi Hsieh
-- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-09 Thread Liang-Chi Hsieh
I just contacted Shane and seems there is ongoing github fetches timing out issue on Jenkins. That being said, currently the QA test is unavailable. I guess it is unsafe to make a release cut due to lack of reliable QA test result. I may defer the cut until QA test comes back if no objection.

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-09 Thread Liang-Chi Hsieh
+1 (non-binding). Thanks for the work! Erik Krogen wrote > +1 from me (non-binding) > > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao > huaxin.gao11@ > wrote: > >> +1 (non-binding) -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-09 Thread Liang-Chi Hsieh
Hi devs, I'm going to cut the branch yesterday. I'd like to share current progress. I hit a problem during dry run of the release script. Fixed it in SPARK-34672. The latest dry run looks okay as build, docs, publish all success. But the last step (push the tag) has a fatal error, I'm not sure

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-08 Thread Liang-Chi Hsieh
Thank you Dongjoon. I'm going to cut the branch now. Hopefully I can make it soon (need to get familiar with the process as first time :) ) Liang-Chi Dongjoon Hyun-2 wrote > Thank you, Liang-Chi! Next Monday sounds good. > > To All. Please ping Liang-Chi if you have a missed backport. > >

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Liang-Chi Hsieh
Thanks all for the input. If there is no objection, I am going to cut the branch next Monday. Thanks. Liang-Chi Takeshi Yamamuro wrote > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering. > Btw, anyone roughly know how many v2.4 users still are based on some stats > (e.g., # of

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Liang-Chi Hsieh
Yeah, in short this is a great compromise approach and I do like to see this proposal move forward to next step. This discussion is valuable. Chao Sun wrote > +1 on Dongjoon's proposal. Great to see this is getting moved forward and > thanks everyone for the insightful discussion! > > > >

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Liang-Chi Hsieh
Thanks Dongjoon! +1 and I volunteer to do the release of 2.4.8 if it passes. Liang-Chi -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Liang-Chi Hsieh
for the inputs and discussion. Liang-Chi Hsieh -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Liang-Chi Hsieh
Basically I think the proposal makes sense to me and I'd like to support the SPIP as it looks like we have strong need for the important feature. Thanks Ryan for working on this and I do also look forward to Wenchen's implementation. Thanks for the discussion too. Actually I think the

Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Liang-Chi Hsieh
Thank you for the inputs! Yikun. Let's take these inputs when we are ready to have rocksdb state store in Spark SS. Yikun Jiang wrote > I worked on some work about rocksdb multi-arch support and version upgrade > on > Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again, > I

[DISCUSS] Add RocksDB StateStore

2021-02-02 Thread Liang-Chi Hsieh
RocksDB StateStore into sql core module 2. not okay for 1, but okay to add RocksDB StateStore as external module 3. either 1 or 2 is okay 4. not okay to add RocksDB StateStore, no matter into sql core or as external module Please let us know if you have some thoughts. Thank you. Liang-Chi Hsieh

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-11 Thread Liang-Chi Hsieh
park 3.1.0. >>> >>> Bests, >>> Dongjoon. >>> >>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves > tgraves_cs@.com > >>> wrote: >>> >>>> +1 since its a correctness issue, I think its ok to change the behavior

[DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-06 Thread Liang-Chi Hsieh
is, this changes current behavior and by default it will break some existing streaming queries. But I think it is pretty easy to disable the check with the new config. In the PR currently there is no objection but suggestion to hear more voices. Please let me know if you have some thoughts. Thanks. Liang-Chi

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Liang-Chi Hsieh
Congrats! Welcome all! Dongjoon Hyun-2 wrote > Welcome everyone! :D > > Bests, > Dongjoon. > > On Tue, Jul 14, 2020 at 11:21 AM Xiao Li > lixiao@ > wrote: > >> Welcome, Dilip, Huaxin and Jungtaek! >> >> Xiao >> >> On Tue, Jul 14, 2020 at 11:02 AM Holden Karau > holden@ > >> wrote: >>

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh
Just got reply from CRAN admin. It should be fixed now. Hyukjin Kwon wrote > Thanks, Liang-chi. > > On Thu, 13 Dec 2018, 8:29 am Liang-Chi Hsieh > viirya@ > wrote: > > >> Sorry for late. There is a malformed record at CRAN package page again. >> I've >

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh
. Thanks. Liang-Chi Hsieh wrote > Thanks for letting me know! I will look into it and ask CRAN admin for > help. > > > Hyukjin Kwon wrote >> Looks it's happening again. Liang-Chi, do you mind if I ask it again? >> >> FYI, R 3.4 is officially deprecated as of &

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh
R version. From what I see, mostly because >>> of fixes and packages support, most users of R are fairly up to date? So >>> perhaps 3.4 as min version is reasonable esp. for Spark 3. >>> >>> Are we getting traction with CRAN sysadmin? It seems like this has been &g

Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Liang-Chi Hsieh
Yeah, thanks Hyukjin Kwon for bringing this up for discussion. I don't know how higher versions of R are widely used across R community. If R version 3.1.x was not very commonly used, I think we can discuss to upgrade minimum R version in next Spark version. If we ended up with not upgrading,

Re: welcome a new batch of committers

2018-10-03 Thread Liang-Chi Hsieh
Congratulations to all new committers! rxin wrote > Hi all, > > The Apache Spark PMC has recently voted to add several new committers to > the project, for their contributions: > > - Shane Knapp (contributor to infra) > - Dongjoon Hyun (contributor to ORC support and other parts of Spark) >

Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Liang-Chi Hsieh
Thanks for pinging me. Seems to me we should not make assumption about the value of spark.sql.execution.topKSortFallbackThreshold config. Once it is changed, the global sort + limit can produce wrong result for now. I will make a PR for this. cloud0fan wrote > + Liang-Chi and Herman, > > I

Re: Revisiting Online serving of Spark models?

2018-06-11 Thread Liang-Chi Hsieh
Hi, It'd be great if there can be any sharing of the offline discussion. Thanks! Holden Karau wrote > We’re by the registration sign going to start walking over at 4:05 > > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice < > maximilianofelice@ >> wrote: > >> Hi! >> >> Do we meet at the

Re: Accessing Hive Tables in Spark

2018-04-12 Thread Liang-Chi Hsieh
Seems like Spark can't access hive-site.xml under cluster mode. One solution is to add the config `spark.yarn.dist.files=/path/to/hive-site.xml` to your spark-defaults.conf. And don't forget to call `enableHiveSupport()` on `SparkSession`. Tushar Singhal wrote > Hi Everyone, > > I was

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-05 Thread Liang-Chi Hsieh
Congratulations! Zhenhua Wang -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming some new committers

2018-03-02 Thread Liang-Chi Hsieh
ckson (contributor to MLlib and PySpark) >> >> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as > committers! >> >> Matei >> - >> To unsubscribe e-mail: > dev-unsubscribe@.apache >> >

Re: How to print plan of Structured Streaming DataFrame

2017-11-20 Thread Liang-Chi Hsieh
reaming sources must > be executed with writeStream.start()”? > > > Thanks > Chang - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Liang-Chi Hsieh
Tejas! > > Matei > > - > To unsubscribe e-mail: > dev-unsubscribe@.apache - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- Sent from: http://apache-spark-d

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Liang-Chi Hsieh
def plus(v1, v2): >> > return v1 + v2 >> > >> > or we can define as: >> > >> > plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >> > >> > We can use it similar to row-by-row UDFs: >> > >> > df.withColumn('sum'

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Liang-Chi Hsieh
def plus(v1, v2): >> > return v1 + v2 >> > >> > or we can define as: >> > >> > plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >> > >> > We can use it similar to row-by-row UDFs: >> > >> > df.withColumn('sum'

Re: Spark 2.2.0 - Odd Hive SQL Warnings

2017-09-03 Thread Liang-Chi Hsieh
un.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.sql.hive.client.Shim_v0_12.alterTable(HiveShim.scala:399) > at > org.apache.spark.sql.hive.client.HiveClientImp

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Liang-Chi Hsieh
Jerry) Shao as a commi >>> tter. Saisai has been contributing to many areas of the >>> project for a long time, so it’s great to see him join. >>> Join me in thanking and congratulating him! >>> >>> Matei >>> --

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-08 Thread Liang-Chi Hsieh
t;>>>> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia < >>>>>>> > matei.zaharia@ >> wrote: >>>>>>> > Hi everyone, >>>>>>> > >>>>>>> > The Spark PMC recently voted to add Hyukjin Kwon and Sameer >>>>>>> Agarwal >>>>>>> as committers. Join me in con

Re: the uniqueSource in StreamExecution, where is it be changed please?

2017-08-05 Thread Liang-Chi Hsieh
t; here is the code link: > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala > > > > Would you like help understand it please? > > > Thanks. > Robin - Liang-Chi Hsie

Re: Question, Flaky tests: pyspark.sql.tests.ArrowTests tests in Jenkins worker 5(?)

2017-08-05 Thread Liang-Chi Hsieh
.. > > > I sounds environment problem apparently due to missing hashtable (which I > believe should have been compiled and importable properly). > > I suspect few possibilities such as a bug somewhere or unsuccessful manual > build from Pandas source but I am unable to

Re: Speeding up Catalyst engine

2017-07-24 Thread Liang-Chi Hsieh
ps://github.com/apache/spark/blob/8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L918 > > Shouldn't lines 925-927 be before 920-922 ? > > 2) https://issues.apache.org/jira/browse/SPARK-20392 > > Is it safe to use it on top of 2.2.0 ? > > Regards, &g

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
a and b > in the whole pipeline, even if the result isn't deterministic, but the > computation is correct. > > Thanks > Chang > > > On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh > viirya@ > wrote: > >> >> IIUC, the evaluation orde

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
>>> does the same thing for GroupBy non-deterministic. From Map-Reduce point >>> of >>> view, Join is also GroupBy in essence . >>> >>> @Liang Chi Hsieh >>> https://plus.google.com/u/0/103179362592085650735?prsrc=4; >>> >>> in wh

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
of > view, Join is also GroupBy in essence . > > @Liang Chi Hsieh > https://plus.google.com/u/0/103179362592085650735?prsrc=4; > > in which situation, semantics will be changed? > > Thanks > Chang > > On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh > vi

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
dious since we have lots of Hive SQL being migrated to Spark. >> And >> this workaround is equivalent to insert a Project between Join operator >> and its child. >> >> Why not do it in PullOutNondeterministic? >> >> Thanks >> Chang >> >&

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-14 Thread Liang-Chi Hsieh
ch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L, >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L, >> > site_categ_id#80, site_categ_parnt_id#81] >> > >> > Does spark sql not support syntax "case when" in JOIN? Additional, my >> spark

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Liang-Chi Hsieh
gt;> The documentation corresponding to this release can be found at: >>>>> https://people.apache.org/~pwendell/spark-releases/spark- >>>>> 2.2.0-rc6-docs/ >>>>> >>>>> >>>>> *FAQ* >>>>> >>>>> *H

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-19 Thread Liang-Chi Hsieh
mean I use kryo with > more > than 2000 partitions all the time, and it worked before. Or was I simply > not hitting this bug because there are other conditions that also need to > be satisfied besides kryo and 2000+ partitions? > > On Jun 19, 2017 2:20 AM, "Liang-Chi

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-19 Thread Liang-Chi Hsieh
I think it's not. This is a feature added recently. Hyukjin Kwon wrote > Is this a regression BTW? I am just curious. > > On 19 Jun 2017 1:18 pm, "Liang-Chi Hsieh" > viirya@ > wrote: > > -1. When using kyro serializer and partition number is greater than

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-18 Thread Liang-Chi Hsieh
-1. When using kyro serializer and partition number is greater than 2000. There seems a NPE issue needed to fix. SPARK-21133 <https://issues.apache.org/jira/browse/SPARK-21133> - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in c

Re: Fwd: [SparkSQL] Project using NamedExpression

2017-03-28 Thread Liang-Chi Hsieh
e how to use that. > > Thanks, > Aviral Agarwal > > On Mar 24, 2017 09:20, "Liang-Chi Hsieh" > viirya@ > wrote: > > > Hi, > > You need to resolve the expressions before passing into creating > UnsafeProjection. > > > > Aviral Agarwal wrote

Re: Fwd: [SparkSQL] Project using NamedExpression

2017-03-23 Thread Liang-Chi Hsieh
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool > Executor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo > lExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > This might be because the Expression is

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Liang-Chi Hsieh
Just found that you can specify number of features when loading libsvm source: val df = spark.read.option("numFeatures", "100").format("libsvm") Liang-Chi Hsieh wrote > As the libsvm format can't specify number of features, and looks like > NaiveBayes

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Liang-Chi Hsieh
t; val model = new >> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData) >> val predictions = model.transform(testData) >> predictions.show() >> >> >> OK, I have got my model by the cole above, but how can I use this model >> to predict the

Re: How to cache SparkPlan.execute for reusing?

2017-03-03 Thread Liang-Chi Hsieh
n of the query, which has several "parent" nodes, > its "parents" have to reuse it by creating new RDDs? - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabbl

Re: How to cache SparkPlan.execute for reusing?

2017-03-02 Thread Liang-Chi Hsieh
or RDD[InternalRow]? ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-cache-SparkPlan-execute-for-reusing-tp21097p21098.html Sent from the Apache Spark Developers List ma

Re: Implementation of RNN/LSTM in Spark

2017-02-28 Thread Liang-Chi Hsieh
Implementation-of-RNN-LSTM-in-Spark-tp14866p21060.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------- >> To unsubscribe e-mail: > dev-unsubscribe

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-26 Thread Liang-Chi Hsieh
ply to this email, your message will be added > to the > discussion below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21069.html > > To start a new topic under Apache Spark Develo

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-26 Thread Liang-Chi Hsieh
set.rdd$lzycompute(Dataset.scala:2544) > org.apache.spark.sql.Dataset.rdd(Dataset.scala:2544)... > > > The CPU usage of the driver remains 100% like this: > > > > I didn't find this issue in Spark 1.6.2, what causes this in Spark 2.1.0? > > > Any help is greatl

Re: A DataFrame cache bug

2017-02-26 Thread Liang-Chi Hsieh
ark.sql.DataFrame >>> >>> def f(data: DataFrame): DataFrame = { >>> val df = data.filter("id>10") >>> df.cache >>> df.count >>> df >>> } >>> >>> f(spark.range(100).asInstanceOf[DataFrame]).count // output

Re: How to checkpoint and RDD after a stage and before reaching an action?

2017-02-05 Thread Liang-Chi Hsieh
; (5) > (6) val rdd2 = loadData2 > (7) > (8) rdd1.checkpoint() > (9) > (10) rdd1 > (11).join(rdd2) > (12).saveAsObjectFile(...) > > / > > Thanks in advance, > Leo - Liang-Chi Hsieh | @viirya Spark Techn

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, FYI, this fix is submitted at https://github.com/apache/spark/pull/16785. Liang-Chi Hsieh wrote > Hi Maciej, > > After looking into the details of the time spent on preparing the executed > plan, the cause of the significant difference between 1.6 and current >

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
503 ms 1613 ms 2279 ms 2349 ms 2573 ms Liang-Chi Hsieh wrote > Hi Maciej, > > Thanks for the info you provided. > > I tried to run the same example with 1.6 and current branch and record the > difference between the time cost on preparing the executed plan. > >

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
ged much in 2.x. They used RDDs for > fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And > the problem doesn't look that related to the data processing part in the > first place. > > > On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote: >> Hi Maciej, >> &

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
en 1.6 and 2.0? > > > On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh > viirya@ > wrote: > >> >> Hi Maciej, >> >> FYI, the PR is at https://github.com/apache/spark/pull/16775. >> >> >> Liang-Chi Hsieh wrote >> > Hi Macie

Re: approx_percentile computation

2017-02-01 Thread Liang-Chi Hsieh
. The Dataframe looks like > > k1 | k2 | k3 | v1 > > a1 | b1 | c1 | 879 > > a2 | b2 | c2 | 769 > > a1 | b1 | c1 | 129 > > a2 | b2 | c2 | 323 > I need to first run groupBy (k1, k2, k3) and collect_list(v1), and then > compute quantiles [10th, 50th...] on list

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Liang-Chi Hsieh
Hi Maciej, FYI, the PR is at https://github.com/apache/spark/pull/16775. Liang-Chi Hsieh wrote > Hi Maciej, > > Basically the fitting algorithm in Pipeline is an iterative operation. > Running iterative algorithm on Dataset would have RDD lineages and query > plans that gro

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Liang-Chi Hsieh
t of the time > idle so it looks like it is a problem with the optimizer. Is it a known > issue? Are there any changes I've missed, that could lead to this > behavior? > > -- > Best, > Maciej > > > - > To unsubscribe e-mail: > dev-unsubscribe@.apache -

  1   2   >