[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245462#comment-15245462 ] Apache Spark commented on SPARK-14283: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12453 > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244571#comment-15244571 ] Apache Spark commented on SPARK-14283: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12453 > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244568#comment-15244568 ] zhengruifeng commented on SPARK-14283: -- [~josephkb] I can work on this. There should be a version of randomSplit that avoid the local sort which is meaningless in ML. But the calls in ML should be add a extra param to avoid local sort IMO. > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible
[ https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223637#comment-15223637 ] Bo Meng commented on SPARK-14283: - Could you please provide more details, such as test cases, use cases, etc.? > Avoid sort in randomSplit when possible > --- > > Key: SPARK-14283 > URL: https://issues.apache.org/jira/browse/SPARK-14283 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Joseph K. Bradley > > Dataset.randomSplit sorts each partition in order to guarantee an ordering > and make randomSplit deterministic given the seed. Since randomSplit is used > a fair amount in ML, it would be great to avoid the sort when possible. > Are there cases when it could be avoided? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org