ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1731949210
@HyukjinKwon since @igorghi has shown with his tests it's not possible to
use repartition().mapInArrow to mimic groupbyApply, would it now make sense to
add groupbyApplyInArrow?
--
ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1716592989
> @EnricoMi @HyukjinKwon Can we get some traction on this PR to either close
as `wont-merge` or evaluate what needs to be done so it can be merged?
Repartition is not working,
ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1686654192
> Arrow was considered as an internal format initially, and that's the whole
reason why pandas came up first. In fact, the number of pandas users are (much)
higher given some stats I
ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1685790842
> `mapInArrow` is marked as a developer API, and my initial intention was to
avoid adding the arrow version of that everywhere - in theory `mapInArrow` can
cover all the cases except
ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1685688494
> I get that `cogroup` might not be possible tho. But we can just convert
pandas back to arrow batches easily. Is this really required for some scenario?
IIRC this is only useful for
ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1685039400
@dongjoon-hyun @zhengruifeng @allisonwang-db @xinrong-meng @HyukjinKwon
Are there any updates on this PR? This would be a very useful feature for
scaling other data frame libraries
ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1662931308
Looking forward to see this PR getting merged :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL