Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r219404072 --- Diff: python/pyspark/sql/tests.py --- @@ -4434,6 +4434,12 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf, df_from_python.toPandas()) self.assertPandasEqual(pdf, df_from_pandas.toPandas()) + def test_toPandas_batch_order(self): + df = self.spark.range(64, numPartitions=8).toDF("a") + with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 4}): + pdf, pdf_arrow = self._toPandas_arrow_toggle(df) + self.assertPandasEqual(pdf, pdf_arrow) --- End diff -- hm, is this test case "enough" to trigger any possible problem just by random? would increasing the number of batch or num record per batch increase the chance of streaming order or concurrency issue perhaps?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org