thvasilo commented on issue #33188: URL: https://github.com/apache/arrow/issues/33188#issuecomment-1606127594
We create the data using Spark 3.1 currently. Specifically many of the feature processing algorithms of SparkML produce vector columns as outputs which we proceed to convert to plain Python float lists and save to parquet. Depending on the parallelism we choose for Spark we can end up in the above situation. I'm not sure if we can explicitly use LargeList to save our data, is there a pyarrow API for that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
