[GitHub] [spark] WeichenXu123 commented on pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-26 Thread via GitHub
WeichenXu123 commented on PR #40724: URL: https://github.com/apache/spark/pull/40724#issuecomment-1524671277 > @mengxr raises another suggestion: uses petastorm to load data from DBFS / HDFS /.. .(so that it can make torch distributor has a simpler interfaces). But there’s a shortcoming tha

[GitHub] [spark] WeichenXu123 commented on pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-12 Thread via GitHub
WeichenXu123 commented on PR #40724: URL: https://github.com/apache/spark/pull/40724#issuecomment-1505144497 @mengxr raises another suggestion: uses petastorm to load data from DBFS / HDFS /.. .(so that it can make torch distributor has a simpler interfaces). But there’s a shortcoming that

[GitHub] [spark] WeichenXu123 commented on pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

2023-04-12 Thread via GitHub
WeichenXu123 commented on PR #40724: URL: https://github.com/apache/spark/pull/40724#issuecomment-1505144052 > what if there are two input datasets, one for training and one for validation? We can add a "is_validation" boolean column to mark it is for training or for validation. -