Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Liang-Chi Hsieh
Thanks for pinging me. Seems to me we should not make assumption about the value of spark.sql.execution.topKSortFallbackThreshold config. Once it is changed, the global sort + limit can produce wrong result for now. I will make a PR for this. cloud0fan wrote > + Liang-Chi and Herman, > > I th

Re: welcome a new batch of committers

2018-10-03 Thread Liang-Chi Hsieh
Congratulations to all new committers! rxin wrote > Hi all, > > The Apache Spark PMC has recently voted to add several new committers to > the project, for their contributions: > > - Shane Knapp (contributor to infra) > - Dongjoon Hyun (contributor to ORC support and other parts of Spark) > -

Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Liang-Chi Hsieh
Yeah, thanks Hyukjin Kwon for bringing this up for discussion. I don't know how higher versions of R are widely used across R community. If R version 3.1.x was not very commonly used, I think we can discuss to upgrade minimum R version in next Spark version. If we ended up with not upgrading, w

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh
at point about min R version. From what I see, mostly because >>> of fixes and packages support, most users of R are fairly up to date? So >>> perhaps 3.4 as min version is reasonable esp. for Spark 3. >>> >>> Are we getting traction with CRAN sysadmin? It s

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh
admin. Thanks. Liang-Chi Hsieh wrote > Thanks for letting me know! I will look into it and ask CRAN admin for > help. > > > Hyukjin Kwon wrote >> Looks it's happening again. Liang-Chi, do you mind if I ask it again? >> >> FYI, R 3.4 is officially deprecat

Re: [discuss] SparkR CRAN feasibility check server problem

2018-12-12 Thread Liang-Chi Hsieh
Just got reply from CRAN admin. It should be fixed now. Hyukjin Kwon wrote > Thanks, Liang-chi. > > On Thu, 13 Dec 2018, 8:29 am Liang-Chi Hsieh < > viirya@ > wrote: > > > >> Sorry for late. There is a malformed record at CRAN package page again. >&g

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Liang-Chi Hsieh
Congrats! Welcome all! Dongjoon Hyun-2 wrote > Welcome everyone! :D > > Bests, > Dongjoon. > > On Tue, Jul 14, 2020 at 11:21 AM Xiao Li < > lixiao@ > > wrote: > >> Welcome, Dilip, Huaxin and Jungtaek! >> >> Xiao >> >> On Tue, Jul 14, 2020 at 11:02 AM Holden Karau < > holden@ > > >> wrote:

[DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-06 Thread Liang-Chi Hsieh
this changes current behavior and by default it will break some existing streaming queries. But I think it is pretty easy to disable the check with the new config. In the PR currently there is no objection but suggestion to hear more voices. Please let me know if you have some thoughts. Thanks. Lian

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-11 Thread Liang-Chi Hsieh
t; +1 for Apache Spark 3.1.0. >>> >>> Bests, >>> Dongjoon. >>> >>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves < > tgraves_cs@.com > > >>> wrote: >>> >>>> +1 since its a correctness issue, I think its ok to ch

[DISCUSS] Add RocksDB StateStore

2021-02-02 Thread Liang-Chi Hsieh
o add RocksDB StateStore into sql core module 2. not okay for 1, but okay to add RocksDB StateStore as external module 3. either 1 or 2 is okay 4. not okay to add RocksDB StateStore, no matter into sql core or as external module Please let us know if you have some thoughts. Thank you. Liang-Chi

Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Liang-Chi Hsieh
Thank you for the inputs! Yikun. Let's take these inputs when we are ready to have rocksdb state store in Spark SS. Yikun Jiang wrote > I worked on some work about rocksdb multi-arch support and version upgrade > on > Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again, > I >

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Liang-Chi Hsieh
Basically I think the proposal makes sense to me and I'd like to support the SPIP as it looks like we have strong need for the important feature. Thanks Ryan for working on this and I do also look forward to Wenchen's implementation. Thanks for the discussion too. Actually I think the SupportsInv

Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Liang-Chi Hsieh
n for the inputs and discussion. Liang-Chi Hsieh -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Liang-Chi Hsieh
Thanks Dongjoon! +1 and I volunteer to do the release of 2.4.8 if it passes. Liang-Chi -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Liang-Chi Hsieh
Yeah, in short this is a great compromise approach and I do like to see this proposal move forward to next step. This discussion is valuable. Chao Sun wrote > +1 on Dongjoon's proposal. Great to see this is getting moved forward and > thanks everyone for the insightful discussion! > > > > On

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-04 Thread Liang-Chi Hsieh
Thanks all for the input. If there is no objection, I am going to cut the branch next Monday. Thanks. Liang-Chi Takeshi Yamamuro wrote > +1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering. > Btw, anyone roughly know how many v2.4 users still are based on some stats > (e.g., # of v

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-08 Thread Liang-Chi Hsieh
Thank you Dongjoon. I'm going to cut the branch now. Hopefully I can make it soon (need to get familiar with the process as first time :) ) Liang-Chi Dongjoon Hyun-2 wrote > Thank you, Liang-Chi! Next Monday sounds good. > > To All. Please ping Liang-Chi if you have a missed backport. > > B

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-09 Thread Liang-Chi Hsieh
Hi devs, I'm going to cut the branch yesterday. I'd like to share current progress. I hit a problem during dry run of the release script. Fixed it in SPARK-34672. The latest dry run looks okay as build, docs, publish all success. But the last step (push the tag) has a fatal error, I'm not sure if

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-09 Thread Liang-Chi Hsieh
+1 (non-binding). Thanks for the work! Erik Krogen wrote > +1 from me (non-binding) > > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao < > huaxin.gao11@ > > wrote: > >> +1 (non-binding) -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-09 Thread Liang-Chi Hsieh
I just contacted Shane and seems there is ongoing github fetches timing out issue on Jenkins. That being said, currently the QA test is unavailable. I guess it is unsafe to make a release cut due to lack of reliable QA test result. I may defer the cut until QA test comes back if no objection. WD

Re: [build system] github fetches timing out

2021-03-10 Thread Liang-Chi Hsieh
-- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [build system] github fetches timing out

2021-03-10 Thread Liang-Chi Hsieh
Thanks Shane for looking at it! shane knapp ☠ wrote > ...and just like that, overnight the builds started successfully git > fetching! > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu -- Sent fro

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Liang-Chi Hsieh
>From Python developer perspective, this direction sounds making sense to me. As pandas is almost the standard library in the related area, if PySpark supports pandas API out of box, the usability would be in a higher level. For maintenance cost, IIUC, there are some Spark committers in the commun

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
To update with current status. There are three tickets targeting 2.4 that are still ongoing. SPARK-34719: Correctly resolve the view query with duplicated column names SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u SPARK-34726: Fix collectToPython timeouts SP

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
Thank you so much, Takeshi! Takeshi Yamamuro wrote > Hi, viirya > > I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a > malformed class name error > on jdk8u" . > > Bests, > Takeshi > > > Takeshi Yamamuro -- Sent from: http://apache-spark-developers-list.1001551.n3.na

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Liang-Chi Hsieh
+1 (non-binding) rxin wrote > +1. Would open up a huge persona for Spark. > > On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < > cutlerb@ > > wrote: > >> >> +1 (non-binding) >> >> >> On Fri, Mar 26, 2021 at 9:49 AM Maciej < > mszymkiewicz@ > > wrote: >> >> >>> +1 (nonbinding) -

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-26 Thread Liang-Chi Hsieh
To update current status. The only one remaining issue for 2.4 is: [SPARK-34855][CORE]spark context - avoid using local lazy val for callSite We are waiting the author to submit a PR for 2.4 branch. Liang-Chi Hsieh wrote > Thank you so much, Takeshi! > > > Takeshi Yamamuro

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Liang-Chi Hsieh
Congrats! Welcome! Matei Zaharia wrote > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor to PySpark) > - Max Gekk (contributor to Spark SQL) > - Kent Yao (

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Liang-Chi Hsieh
Hi devs, Currently no open issues or ongoing issues targeting 2.4. On QA test dashboard, only spark-branch-2.4-test-sbt-hadoop-2.6 is in red status. The failed test is org.apache.spark.sql.streaming.StreamingQueryManagerSuite.awaitAnyTermination with timeout and resetTerminated. It looks a flaky

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Liang-Chi Hsieh
Thanks Hyukjin and Dongjoon! :) Then I will start RC. Dongjoon Hyun-2 wrote > Given that Maven passed already with that profile and you tested locally, > I'm +1 for staring RC. > > Thanks, > Dongjoon. > > On Sun, Apr 4, 2021 at 2:24 AM Hyukjin Kwon < > gurwls223@ > > wrote: > >> I would +

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-05 Thread Liang-Chi Hsieh
Hi Mingjia, Thanks for fixing it. I can see it is included. Liang-Chi mingjia-2 wrote > Hi, All. > > I fixed SPARK-32708 > ; > a while ago after 2.4.7 release. > PR:https://github.com/apache/spark/pull/29564 > > Since it's not listed as

[VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until Apr 10th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because .

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
Thanks for voting. After I started running the release script to cut RC1 for a while, I found a nested column pruning bug SPARK-34963, and unfortunately it exists in 2.4.7 too. As RC1 is cut, so I continue this voting. The bug looks corner case to me and it is not reported yet since we support ne

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh
I'm working on the fix for master. I think the fix is the same for 2.4. Okay. So I think we are in favor of RC2 and RC1 is dropped. Then I will make the fix merged first and then prepare RC2. Thank you. Liang-Chi Mridul Muralidharan wrote > Do we have a fix for this in 3.x/master which can b

[VOTE] Release Spark 2.4.8 (RC2)

2021-04-11 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until Apr 15th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-15 Thread Liang-Chi Hsieh
Thanks all for voting. Unfortunately, we found a long-standing correctness bug SPARK-35080 and 2.4 was affected too. That is said we need to drop RC2 in favor of RC3. The fix is ready for merging at https://github.com/apache/spark/pull/32179. -- Sent from: http://apache-spark-developers-list

[VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until May 4th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because ...

Re: [DISCUSS] Add RocksDB StateStore

2021-04-28 Thread Liang-Chi Hsieh
I am fine with RocksDB state store as built-in state store. Actually the proposal to have it as external module is to avoid the raised concern in the previous effort. The need to have it as experimental doesn't necessarily mean to have it as external module, I think. They are two things. So I don'

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-30 Thread Liang-Chi Hsieh
Hi all, Thanks for actively voting. Unfortunately, we found a very ancient bug (SPARK-35278), and the fix (https://github.com/apache/spark/pull/32404) is going to be merged soon. We may fail this RC3. I will go to cut RC4 as soon as the fix is merged. Thank you! -- Sent from: http://apache-sp

RE: [VOTE] Release Spark 2.4.8 (RC3)

2021-05-04 Thread Liang-Chi Hsieh
> > > United States > > > > > > > > > > > > > > > From: Liang-Chi Hsieh < > viirya@ > > > To: > dev@.apache > Date: 04/30/2021 03:12 PM > S

[VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread Liang-Chi Hsieh
Please vote on releasing the following candidate as Apache Spark version 2.4.8. The vote is open until May 14th at 9AM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.8 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread Liang-Chi Hsieh
Yea, I don't know why it happens. I remember RC1 also has the same issue. But RC2 and RC3 don't. Does it affect the RC? John Zhuge wrote > Got this error when browsing the staging repository: > > 404 - Repository "orgapachespark-1383 (staging: open)" > [id=orgapachespark-1383] exists but is no

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
I did upload my public key in https://dist.apache.org/repos/dist/dev/spark/KEYS. I also uploaded it to public keyserver before cutting RC1. I just also try to search the public key and can find it. cloud0fan wrote > [image: image.png] > > I checked the log in https://repository.apache.org/#sta

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
I don't know what will happens if I manually close it now. Not sure if the current status cause a problem? If not, maybe leave as it is? Sean Owen-2 wrote > Hm, yes I see it at > http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee&fingerprint=on&op=index > but not on keyserver.ub

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
Oh, I see. We cannot do release on it as it is still open status. Okay, let me try to close it manually via UI. Sean Owen-2 wrote > Is there a separate process that pushes to maven central? That's what we > have to have in the end. > > On Tue, May 11, 2021, 12:31 PM Liang-Chi

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
Seems it is closed now after clicking close button in the UI. Sean Owen-2 wrote > Is there a separate process that pushes to maven central? That's what we > have to have in the end. > > On Tue, May 11, 2021, 12:31 PM Liang-Chi Hsieh < > viirya@ > > wrote: > &

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Liang-Chi Hsieh
The staging repository for this release can be accessed now too: https://repository.apache.org/content/repositories/orgapachespark-1383/ Thanks for the guidance. Liang-Chi Hsieh wrote > Seems it is closed now after clicking close button in the UI. -- Sent from: http://apache-sp

[VOTE][RESULT] Release Spark 2.4.8 (RC4)

2021-05-14 Thread Liang-Chi Hsieh
The vote passes. Thanks to all who helped with the release! (* = binding) +1: - Dongjoon Hyun * - Takeshi Yamamuro - Maxim Gekk - John Zhuge - Hyukjin Kwon * - Kent Yao - Sean Owen * - Kousuke Saruta - Holden Karau * - Wenchan Fan * - Mridul Muralidharan * - Ismaël Mejía +0: None -1: None

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Liang-Chi Hsieh
+1 sounds good. Thanks Dongjoon for volunteering on this! Liang-Chi Dongjoon Hyun-2 wrote > Hi, All. > > Since Apache Spark 3.1.1 tag creation (Feb 21), > new 172 patches including 9 correctness patches and 4 K8s patches arrived > at branch-3.1. > > Shall we make a new release, Apache Spark 3

[ANNOUNCE] Apache Spark 2.4.8 released

2021-05-17 Thread Liang-Chi Hsieh
We are happy to announce the availability of Spark 2.4.8! Spark 2.4.8 is a maintenance release containing stability, correctness, and security fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release. To dow

Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Liang-Chi Hsieh
+1 Thanks Takeshi! Prashant Sharma wrote > +1 > > On Thu, May 20, 2021 at 7:08 PM Wenchen Fan < > cloud0fan@ > > wrote: > >> +1 >> >> On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun < > dongjoon.hyun@ > > >> wrote: >> >>> +1. >>> >>> Thank you, Takeshi. >>> >>> On Wed, May 19, 2021 at 7:49

Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread Liang-Chi Hsieh
+1 (non-binding) Binary and doc looks good. JIRA tickets looks good. Ran simple tasks. Thank you, Dongjoon! Hyukjin Kwon wrote > +1 > > 2021년 5월 26일 (수) 오전 9:00, Cheng Su < > chengsu@.com > >님이 작성: > >> +1 (non-binding) >> >> >> >> Checked the related commits in commit history manually. >>

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Liang-Chi Hsieh
Thank you, Dongjoon! Takeshi Yamamuro wrote > Thank you, Dongjoon! > > On Wed, Jun 2, 2021 at 2:29 PM Xiao Li < > lixiao@ > > wrote: > >> Thank you! >> >> Xiao >> >> On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon < > gurwls223@ > > wrote: >> >>> awesome! >>> >>> 2021년 6월 2일 (수) 오전 9:59, Dongj

Re: Apache Spark 3.0.3 Release?

2021-06-08 Thread Liang-Chi Hsieh
+1. Thank you! Liang-Chi Dongjoon Hyun-2 wrote > +1, Thank you! :) > > Bests, > Dongjoon. > > On Tue, Jun 8, 2021 at 9:05 PM Kent Yao < > yaooqinn@ > > wrote: > >> +1. Thanks, Yi ~ >> >> Bests, >> *Kent Yao * >> @ Data Science Center, Hangzhou Research Institute, NetEase Corp. >> *a spark

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh
ssionization (session window) ><https://issues.apache.org/jira/browse/SPARK-10816>; >- Add RocksDB StateStore as external module ><https://issues.apache.org/jira/browse/SPARK-34198>; > > > I wonder whether we should postpone the branch cut date. > cc Min S

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh
imited backporting. > > https://github.com/apache/spark/graphs/commit-activity > > Bests, > Dongjoon. > > > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh < > viirya@ > > wrote: > >> First, thanks for being volunteer as the release manager

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-18 Thread Liang-Chi Hsieh
+1. Docs looks good. Binary looks good. Ran simple test and some tpcds queries. Thanks for working on this! wuyi wrote > Please vote on releasing the following candidate as Apache Spark version > 3.0.3. > > The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC > votes are c

[VOTE][RESULT] SPIP: Storage Partitioned Join for Data Source V2

2021-11-02 Thread Liang Chi Hsieh
Hi all, The vote passed with the following 9 +1 votes and no -1 or +0 votes: Liang-Chi Hsieh* Russell Spitzer Dongjoon Hyun* Huaxin Gao Ryan Blue DB Tsai* Holden Karau* Cheng Su Wenchen Fan* * = binding Thank you guys all for your feedback and votes

RE: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-12 Thread Liang Chi Hsieh
I’d vote my +1 first. On 2021/11/13 02:25:05 "L. C. Hsieh" wrote: > Hi all, > > I’d like to start a vote for SPIP: Row-level operations in Data Source V2. > > The proposal is to add support for executing row-level operations > such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The > exec

Re: java.lang.IllegalStateException: There is no space for new record

2016-12-09 Thread Liang-Chi Hsieh
Hi Nick, I think it is due to a bug in UnsafeKVExternalSorter. I created a Jira and a PR for this bug: https://issues.apache.org/jira/browse/SPARK-18800 - Liang-Chi Hsieh | @viirya Spark Technology Center -- View this message in context: http://apache-spark-developers-list.1001551

Re: Document Similarity -Spark Mllib

2016-12-10 Thread Liang-Chi Hsieh
o. You can also adjust the threshold when using DIMSUM. [1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix Square using MapReduce (DIMSUM)" [2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity Computation" - Liang-Chi Hsieh | @

Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-11 Thread Liang-Chi Hsieh
Hi Dongjoon, I know some people only use Spark SQL with SQL syntax not Dataset API. So I think it should be useful to provide a way to do this in SQL. - Liang-Chi Hsieh | @viirya Spark Technology Center -- View this message in context: http://apache-spark-developers-list.1001551.n3

Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-12-11 Thread Liang-Chi Hsieh
Hi, There is a plan to add this into Spark ML. Please check out https://issues.apache.org/jira/browse/SPARK-18023. You can also follow this jira to get the latest update. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http

Re: Document Similarity -Spark Mllib

2016-12-13 Thread Liang-Chi Hsieh
sure the problem is exactly at columnSimilarities? E.g, val exact = mat.columnSimilarities(0.5) val exactCount = exact.entries.count - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551

Re: Document Similarity -Spark Mllib

2016-12-15 Thread Liang-Chi Hsieh
like it might not work much better than brute-force even you set a higher threshold. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib

Re: Mistake in Apache Spark Java.

2016-12-16 Thread Liang-Chi Hsieh
Hi, I tried your example with latest Spark master branch and branch-2.0. It works well. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mistake-in-Apache-Spark-Java

Re: Aggregating over sorted data

2016-12-18 Thread Liang-Chi Hsieh
as an user-defined JVM object as buffer to hold the input data into your aggregate function. But you may need to write necessary encoder for the buffer object. If you really need this feature, you may open a Jira to ask others' opinion about this feature. - Liang-Chi Hsieh | @viirya S

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-20 Thread Liang-Chi Hsieh
if it is the root cause and fix it then. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tp20108p20298.html

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread Liang-Chi Hsieh
Hi, You can't invoke any RDD actions/transformations inside another transformations. They must be invoked by the driver. If I understand your purpose correctly, you can partition your data (i.e., `partitionBy`) when writing out to parquet files. - Liang-Chi Hsieh | @viirya

Re: Aggregating over sorted data

2016-12-20 Thread Liang-Chi Hsieh
collect_list('letter)) .show() +--++ |number|collect_list(letter)| +--++ | 3| [a, b, c]| | 1| [a, b, c]| | 2| [a, b, c]| +------+----+ I think it should let you do aggrega

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
on the node which the driver runs. It looks weird. Looks like you try to fetch some data first and do some jobs on the data. Can't you just do those jobs in the main driver as Spark actions with its API? - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
sure the node running the driver has enough resources to run them. I am not sure if you can use `SparkLauncher` to submit them in different mode, e.g., main driver in client mode, others in cluster mode. Worth trying. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc

Re: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread Liang-Chi Hsieh
properly installed in all nodes in the cluster, because those Spark jobs will be launched at the node which the main driver is running on. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-21 Thread Liang-Chi Hsieh
_code").parquet(.....) - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Null-pointer-exception-with-RDD-while-computing-a-method-creating-dataframe-tp20308p20328.html Sent from the Apa

Re: Aggregating over sorted data

2016-12-21 Thread Liang-Chi Hsieh
. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20331.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com

Re: Aggregating over sorted data

2016-12-22 Thread Liang-Chi Hsieh
h in spark-sorted. see here: > https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala > > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh < > viirya@ > > wrote: > >> >> I agreed that to make su

Re: stratified sampling scales poorly

2016-12-22 Thread Liang-Chi Hsieh
= Map(“key1”-> fraction, “key2”-> fraction, …, “keyn”-> > fraction). > > I have a question is that why stratified sampling scales poorly with > different sampling fractions in this context? meanwhile simple random > sampling scales well with different sampling fractions (I ra

Re: java.lang.AssertionError: assertion failed

2016-12-22 Thread Liang-Chi Hsieh
-18986. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-AssertionError-assertion-failed-tp20277p20338.html Sent from the Apache Spark Developers List mailing list

Re: MapOutputTracker.getMapSizesByExecutorId and mutation on the driver?

2016-12-23 Thread Liang-Chi Hsieh
spark/MapOutputTracker.scala#L133 > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > -------

Re: Shuffle intermidiate results not being cached

2016-12-26 Thread Liang-Chi Hsieh
union x2))) ... Your first example just does two aggregation operations. But your second example like above does this aggregation operations for each iteration. So the time of second example grows as the iteration increases. - Liang-Chi Hsieh | @viirya Spark Technology Center http

RE: Shuffle intermidiate results not being cached

2016-12-27 Thread Liang-Chi Hsieh
slower than batch. > Is there a correct way to do such an aggregation on streaming data (using > dataframes rather than RDD operations). > Assaf. > > > > From: Liang-Chi Hsieh [via Apache Spark Developers List] [mailto: > ml-node+s1001551n20361h80@.nabble > ] >

RE: Shuffle intermidiate results not being cached

2016-12-28 Thread Liang-Chi Hsieh
performance wise) was to write a custom > UDAF which does the window internally. This was still 8 times lower > throughput than batch and required a lot of coding and is not a general > solution. > > I am looking for an approach to improve the performance even more > (preferably to eith

RE: Shuffle intermidiate results not being cached

2016-12-28 Thread Liang-Chi Hsieh
} println(s"Took $timeLen miliseconds") totalTime += timeLen } val timeLen2 = time { val grouped2 = allDF.groupBy("cat1").agg(sum($"v"), count($"cat2")) grouped2.show() } totalTime += timeLen2 println(s"Overall time was $totalTime miliseconds

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Liang-Chi Hsieh
cala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) >> at scala.tools.nsc.interpreter.ILoop.interpretStartingWith( >> ILoop.scala:807) >> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) >> at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) >> at scala.to

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

2016-12-28 Thread Liang-Chi Hsieh
//bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > --------- > To unsubscribe e-mail: > dev-unsubscribe@.apache - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-30 Thread Liang-Chi Hsieh
n Union of 100 Range plans, there are 5049 Range(s) needed to go through. For 200 Range plans, it becomes 20099. You can see it is not linear relation. ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-devel

Re: context.runJob() was suspended in getPreferredLocations() function

2017-01-01 Thread Liang-Chi Hsieh
errors and no > outputs. > > What is the reason for it? > > Thanks, > Fei - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/context-runJob-was-suspended-in-ge

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Liang-Chi Hsieh
t;> >> >> -- >> View this message in context: http://apache-spark- >> developers-list.1001551.n3.nabble.com/What-is-mainly- >> different-from-a-UDT-and-a-spark-internal-type-that- >> ExpressionEncoder-recognized-tp20370.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >>

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
pparquetdata/corruptblock.0 is not a > Parquet file. expected magic number at tail [80, 65, 82, 49] but found > [65, 82, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
Forget to say, another option is we can replace readAllFootersInParallel with our parallel reading logic, so we can ignore corrupt files. Liang-Chi Hsieh wrote > Hi, > > The method readAllFootersInParallel is implemented in Parquet's > ParquetFileReader. So

Re: Quick request: prolific PR openers, review your open PRs

2017-01-04 Thread Liang-Chi Hsieh
s/yanboliang>; 10 > cloud-fan <https://spark-prs.appspot.com/users/cloud-fan>; 8 > jerryshao <https://spark-prs.appspot.com/users/jerryshao>; 8 - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache

Re: Converting an InternalRow to a Row

2017-01-04 Thread Liang-Chi Hsieh
System.out.println("Round trip: " + roundTrip.size()); > } > > The code fails at the line encoder.fromRow() with the exception: >> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate > expression: getcolumnbyordinal(0, IntegerType) > > --- >

Re: Skip Corrupted Parquet blocks / footer.

2017-01-04 Thread Liang-Chi Hsieh
2, 49, 10] > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) > > > Please let me know if I am missing anything. - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in con

Re: Converting an InternalRow to a Row

2017-01-05 Thread Liang-Chi Hsieh
tribute::toAttribute). >> collect(Collectors.toList()); >> val encoder = >> RowEncoder.apply(schema).resolveAndBind(ScalaUtils.scalaSeq(attributes), >> SimpleAnalyzer$.MODULE$); >> >> >> --- >> Regards, >> Andy >> >> On Thu, Jan 5, 2017

Re: Parquet patch release

2017-01-07 Thread Liang-Chi Hsieh
estions, please follow that thread on the Parquet list. > > Thanks! > > rb > > -- > Ryan Blue > Software Engineer > Netflix - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-devel

Re: Converting an InternalRow to a Row

2017-01-07 Thread Liang-Chi Hsieh
nternalRow > class TopKDataType extends UserDefinedType > > { > private final ExpressionEncoder > > unboundedEncoder; > private final List > > data; > >public Row[] rows() { > val encoder = resolveAndBind(this.unboundedEncoder); > >

Re: handling of empty partitions

2017-01-08 Thread Liang-Chi Hsieh
such empty partitions? > http://stackoverflow.com/questions/41474175/spark-mappartitionswithindex-handling-empty-partitions > > Kind regards, > Georg - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-develope

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Liang-Chi Hsieh
4 >> -0.19984, 0.970241]) >> >> This result has std=sqrt(2/3) >> >> Instead it should have resulted other 3 vectors that form std=1 for each >> column. >> >> Adding another vector (4 total) results in 4 scaled vectors that form >> std= sqrt(3/4) ins

Re: handling of empty partitions

2017-01-09 Thread Liang-Chi Hsieh
n you do mapPartitions, it just gives you an empty iterator as input. You can do what you need. You already return a None when you find an empty iterator in preparing "toCarry". So I was wondering what you want to ask in the previous reply. geoHeil wrote > Thanks a lot, Ho

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Liang-Chi Hsieh
.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > - > To unsubscribe e-mail: > dev-unsubscribe@.apache - Liang-Chi Hsieh | @viirya Spark

  1   2   >