[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690830#comment-16690830 ] Apache Spark commented on SPARK-25344: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23077 > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685613#comment-16685613 ] Bryan Cutler commented on SPARK-25344: -- [~hyukjin.kwon] no problem, I can take on ML and MLlib > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685494#comment-16685494 ] Hyukjin Kwon commented on SPARK-25344: -- Oh yea. Will do as soon as it gets merged. > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685485#comment-16685485 ] Imran Rashid commented on SPARK-25344: -- sounds good to me. it makes sense to break the work up into smaller chunks, so you dont' have one giant change that is hard to merge / constantly has merge conflicts. I know I'm a broken record on this, but I think we should tell dev@ about the new organization and get all new tests to follow the new pattern. that makes it a lot easier to keep refactoring incrementally. > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685394#comment-16685394 ] Hyukjin Kwon commented on SPARK-25344: -- Hey [~bryanc], once the first try got merged, mind if I ask to take a look for some of sub tasks? In particular, I would appreciate if you have a change to take a look for ML and MLlib. > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684910#comment-16684910 ] Hyukjin Kwon commented on SPARK-25344: -- [~irashid] and [~bryanc], let me open a PR that breaks tests in SQL into small files first. We can see if that makes sense and then let's finish up other modules as well. > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org