[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690830#comment-16690830
 ] 

Apache Spark commented on SPARK-25344:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23077

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685613#comment-16685613
 ] 

Bryan Cutler commented on SPARK-25344:
--

[~hyukjin.kwon] no problem, I can take on ML and MLlib

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685494#comment-16685494
 ] 

Hyukjin Kwon commented on SPARK-25344:
--

Oh yea. Will do as soon as it gets merged.

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685485#comment-16685485
 ] 

Imran Rashid commented on SPARK-25344:
--

sounds good to me.  it makes sense to break the work up into smaller chunks, so 
you dont' have one giant change that is hard to merge / constantly has merge 
conflicts.

I know I'm a broken record on this, but I think we should tell dev@ about the 
new organization and get all new tests to follow the new pattern.  that makes 
it a lot easier to keep refactoring incrementally.

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685394#comment-16685394
 ] 

Hyukjin Kwon commented on SPARK-25344:
--

Hey [~bryanc], once the first try got merged, mind if I ask to take a look for 
some of sub tasks? In particular, I would appreciate if you have a change to 
take a look for ML and MLlib.

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684910#comment-16684910
 ] 

Hyukjin Kwon commented on SPARK-25344:
--

[~irashid] and [~bryanc], let me open a PR that breaks tests in SQL into small 
files first. We can see if that makes sense and then let's finish up other 
modules as well.

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org