[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-27 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986259#comment-15986259
 ] 

Daniel Li commented on SPARK-20468:
---

Great, thanks Sean.  How should I go forward with the documentation?  Detail 
proposed documentation in the ticket before starting a PR, or just go ahead and 
create a PR?

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986191#comment-15986191
 ] 

Sean Owen commented on SPARK-20468:
---

OK, then this should be closed. Documentation is probably always a win, if you 
are confident you can do so. Splitting files up is neither here nor there, 
wouldn't say it's that important for this code. I'd generally check in with the 
author of the file if you can before proposing to track this as a task.

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985888#comment-15985888
 ] 

Daniel Li commented on SPARK-20468:
---

Thanks for the guidance, [~srowen].  I've read the Contributing guide now; my 
apologies for not reading it through sooner.

I've closed the PR and created a few new Jira issues to propose documentation 
additions and internal code clarity improvements: SPARK-20484, SPARK-20485, and 
SPARK-20486.  How do you want to proceed?

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984365#comment-15984365
 ] 

Sean Owen commented on SPARK-20468:
---

Please read http://spark.apache.org/contributing.html first, we don't assign 
issues at this stage.

A lot of this is just taste, so I'm not sure it clearly improves readability.
A major problem with some of these suggestions, if I understand correctly, is 
that you'd be breaking an API.
I think the while loops are on purpose to avoid a little overhead of a for loop.

I'd start significantly smaller with things like a proposed doc cleanup, or 
clear improvements to internal-only elements.
Please also weigh the cost: reviewer time, obstacle to back-ports.

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984350#comment-15984350
 ] 

Daniel Li commented on SPARK-20468:
---

I'd appreciate if someone with permissions could assign this issue to me.

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984346#comment-15984346
 ] 

Apache Spark commented on SPARK-20468:
--

User 'danielyli' has created a pull request for this issue:
https://github.com/apache/spark/pull/17767

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org