[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-09-22 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1731949210 @HyukjinKwon since @igorghi has shown with his tests it's not possible to use repartition().mapInArrow to mimic groupbyApply, would it now make sense to add groupbyApplyInArrow? --

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-09-12 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1716592989 > @EnricoMi @HyukjinKwon Can we get some traction on this PR to either close as `wont-merge` or evaluate what needs to be done so it can be merged? Repartition is not working,

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-08-21 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1686654192 > Arrow was considered as an internal format initially, and that's the whole reason why pandas came up first. In fact, the number of pandas users are (much) higher given some stats I

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-08-21 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1685790842 > `mapInArrow` is marked as a developer API, and my initial intention was to avoid adding the arrow version of that everywhere - in theory `mapInArrow` can cover all the cases except

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-08-20 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1685688494 > I get that `cogroup` might not be possible tho. But we can just convert pandas back to arrow batches easily. Is this really required for some scenario? IIRC this is only useful for

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-08-19 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1685039400 @dongjoon-hyun @zhengruifeng @allisonwang-db @xinrong-meng @HyukjinKwon Are there any updates on this PR? This would be a very useful feature for scaling other data frame libraries

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

2023-08-02 Thread via GitHub
ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1662931308 Looking forward to see this PR getting merged :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL