[jira] [Assigned] (SPARK-17072) generate table level stats:stats generation/storing/loading
[ https://issues.apache.org/jira/browse/SPARK-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17072: Assignee: (was: Apache Spark) > generate table level stats:stats generation/storing/loading > --- > > Key: SPARK-17072 > URL: https://issues.apache.org/jira/browse/SPARK-17072 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > need to generating , storing, and loading statistics information into/from > meta store. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17072) generate table level stats:stats generation/storing/loading
[ https://issues.apache.org/jira/browse/SPARK-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427692#comment-15427692 ] Apache Spark commented on SPARK-17072: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/14712 > generate table level stats:stats generation/storing/loading > --- > > Key: SPARK-17072 > URL: https://issues.apache.org/jira/browse/SPARK-17072 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > need to generating , storing, and loading statistics information into/from > meta store. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17072) generate table level stats:stats generation/storing/loading
[ https://issues.apache.org/jira/browse/SPARK-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17072: Assignee: Apache Spark > generate table level stats:stats generation/storing/loading > --- > > Key: SPARK-17072 > URL: https://issues.apache.org/jira/browse/SPARK-17072 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu >Assignee: Apache Spark > > need to generating , storing, and loading statistics information into/from > meta store. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427687#comment-15427687 ] Felix Cheung commented on SPARK-16581: -- I think JVM<->R is closely related to RBackend? Because we are not trying to build a library to generically work with JVM from R (like py4j) but only the JVM that Spark is running, via custom socket protocol; it might come a time we want to operate from a R shell while working with multiple JVM backends (or remote backends), or want to have more control over recycling the backend process not completely dissimilar to cleanup.jobj, etc. In addition to connect to a remote JVM, we might want to expose JVM side RBackend API to allow re-using an existing Spark JVM process (several Spark JIRAs in the past) for cases with Spark Job Server (persisted spark session), Apache Toree (incubating) / Livy (cross-languages support) (eg. https://issues.cloudera.org/projects/LIVY/issues/LIVY-194) Possibly some of these could change how callJMethod/invokeJava works, what parameters are required and so on. Of course, all of these could be very far off :) > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15816) SQL server based on Postgres protocol
[ https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427671#comment-15427671 ] Takeshi Yamamuro edited comment on SPARK-15816 at 8/19/16 6:07 AM: --- [~sarutak][~dobashim] I just posted the design doc. and this is currently under the review of saruta-san and dobashi-san. was (Author: maropu): [~sarutak] I just posted the design doc. and this is currently under the review of saruta-san. > SQL server based on Postgres protocol > - > > Key: SPARK-15816 > URL: https://issues.apache.org/jira/browse/SPARK-15816 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > Attachments: New_SQL_Server_for_Spark.pdf > > > At Spark Summit today this idea came up from a discussion: it would be great > to investigate the possibility of implementing a new SQL server using > Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket > to track this idea, in case others have feedback. > This server can have a simpler architecture, and allows users to leverage a > wide range of tools that are already available for Postgres (and many > commercial database systems based on Postgres). > Some of the problems we'd need to figure out are: > 1. What is the Postgres protocol? Is there an official documentation for it? > 2. How difficult would it be to implement that protocol in Spark (JVM in > particular). > 3. How does data type mapping work? > 4. How does system commands work? Would Spark need to support all of > Postgres' commands? > 5. Any restrictions in supporting nested data? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15816) SQL server based on Postgres protocol
[ https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427671#comment-15427671 ] Takeshi Yamamuro commented on SPARK-15816: -- [~sarutak] I just posted the design doc. and this is currently under the review of saruta-san. > SQL server based on Postgres protocol > - > > Key: SPARK-15816 > URL: https://issues.apache.org/jira/browse/SPARK-15816 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > Attachments: New_SQL_Server_for_Spark.pdf > > > At Spark Summit today this idea came up from a discussion: it would be great > to investigate the possibility of implementing a new SQL server using > Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket > to track this idea, in case others have feedback. > This server can have a simpler architecture, and allows users to leverage a > wide range of tools that are already available for Postgres (and many > commercial database systems based on Postgres). > Some of the problems we'd need to figure out are: > 1. What is the Postgres protocol? Is there an official documentation for it? > 2. How difficult would it be to implement that protocol in Spark (JVM in > particular). > 3. How does data type mapping work? > 4. How does system commands work? Would Spark need to support all of > Postgres' commands? > 5. Any restrictions in supporting nested data? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15816) SQL server based on Postgres protocol
[ https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-15816: - Attachment: New_SQL_Server_for_Spark.pdf > SQL server based on Postgres protocol > - > > Key: SPARK-15816 > URL: https://issues.apache.org/jira/browse/SPARK-15816 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > Attachments: New_SQL_Server_for_Spark.pdf > > > At Spark Summit today this idea came up from a discussion: it would be great > to investigate the possibility of implementing a new SQL server using > Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket > to track this idea, in case others have feedback. > This server can have a simpler architecture, and allows users to leverage a > wide range of tools that are already available for Postgres (and many > commercial database systems based on Postgres). > Some of the problems we'd need to figure out are: > 1. What is the Postgres protocol? Is there an official documentation for it? > 2. How difficult would it be to implement that protocol in Spark (JVM in > particular). > 3. How does data type mapping work? > 4. How does system commands work? Would Spark need to support all of > Postgres' commands? > 5. Any restrictions in supporting nested data? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17140) Add initial model to MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427664#comment-15427664 ] Seth Hendrickson commented on SPARK-17140: -- I can take this one. > Add initial model to MultinomialLogisticRegression > -- > > Key: SPARK-17140 > URL: https://issues.apache.org/jira/browse/SPARK-17140 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should add initial model support to Multinomial logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16822) Support latex in scaladoc with MathJax
[ https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427661#comment-15427661 ] Apache Spark commented on SPARK-16822: -- User 'jagadeesanas2' has created a pull request for this issue: https://github.com/apache/spark/pull/14711 > Support latex in scaladoc with MathJax > -- > > Key: SPARK-16822 > URL: https://issues.apache.org/jira/browse/SPARK-16822 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Shuai Lin >Assignee: Shuai Lin >Priority: Minor > Fix For: 2.1.0 > > > The scaladoc of some classes (mainly ml/mllib classes) include math formulas, > but currently it renders very ugly, e.g. [the doc of the LogisticGradient > class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. > We can improve this by including MathJax javascripts in the scaladocs page, > much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression
Seth Hendrickson created SPARK-17151: Summary: Decide how to handle inferring number of classes in Multinomial logistic regression Key: SPARK-17151 URL: https://issues.apache.org/jira/browse/SPARK-17151 Project: Spark Issue Type: Sub-task Reporter: Seth Hendrickson Priority: Minor This JIRA is to discuss how the number of label classes should be inferred in multinomial logistic regression. Currently, MLOR checks the dataframe metadata and if the number of classes is not specified then it uses the maximum value seen in the label column. If the labels are not properly indexed, then this can cause a large number of zero coefficients and potentially produce instabilities in model training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16216) CSV data source does not write date and timestamp correctly
[ https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16216: Target Version/s: 2.0.1, 2.1.0 Priority: Blocker (was: Major) > CSV data source does not write date and timestamp correctly > --- > > Key: SPARK-16216 > URL: https://issues.apache.org/jira/browse/SPARK-16216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Blocker > Labels: releasenotes > > Currently, CSV data source write {{DateType}} and {{TimestampType}} as below: > {code} > ++ > |date| > ++ > |14406372| > |14144598| > |14540400| > ++ > {code} > It would be nicer if it write dates and timestamps as a formatted string just > like JSON data sources. > Also, CSV data source currently supports {{dateFormat}} option to read dates > and timestamps in a custom format. It might be better if this option can be > applied in writing as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16533) Spark application not handling preemption messages
[ https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427625#comment-15427625 ] Apache Spark commented on SPARK-16533: -- User 'angolon' has created a pull request for this issue: https://github.com/apache/spark/pull/14710 > Spark application not handling preemption messages > -- > > Key: SPARK-16533 > URL: https://issues.apache.org/jira/browse/SPARK-16533 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, > YARN >Affects Versions: 1.6.0 > Environment: Yarn version: Hadoop 2.7.1-amzn-0 > AWS EMR Cluster running: > 1 x r3.8xlarge (Master) > 52 x r3.8xlarge (Core) > Spark version : 1.6.0 > Scala version: 2.10.5 > Java version: 1.8.0_51 > Input size: ~10 tb > Input coming from S3 > Queue Configuration: > Dynamic allocation: enabled > Preemption: enabled > Q1: 70% capacity with max of 100% > Q2: 30% capacity with max of 100% > Job Configuration: > Driver memory = 10g > Executor cores = 6 > Executor memory = 10g > Deploy mode = cluster > Master = yarn > maxResultSize = 4g > Shuffle manager = hash >Reporter: Lucas Winkelmann > > Here is the scenario: > I launch job 1 into Q1 and allow it to grow to 100% cluster utilization. > I wait between 15-30 mins ( for this job to complete with 100% of the cluster > available takes about 1hr so job 1 is between 25-50% complete). Note that if > I wait less time then the issue sometimes does not occur, it appears to be > only after the job 1 is at least 25% complete. > I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to > allow 70% of cluster utilization. > At this point job 1 basically halts progress while job 2 continues to execute > as normal and finishes. Job 2 either: > - Fails its attempt and restarts. By the time this attempt fails the other > job is already complete meaning the second attempt has full cluster > availability and finishes. > - The job remains at its current progress and simply does not finish ( I have > waited ~6 hrs until finally killing the application ). > > Looking into the error log there is this constant error message: > WARN NettyRpcEndpointRef: Error sending message [message = > RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: > ip-NUMBERS.ec2.internal was preempted.)] in X attempts > > My observations have led me to believe that the application master does not > know about this container being killed and continuously asks the container to > remove the executor until eventually failing the attempt or continue trying > to remove the executor. > > I have done much digging online for anyone else experiencing this issue but > have come up with nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16533) Spark application not handling preemption messages
[ https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16533: Assignee: Apache Spark > Spark application not handling preemption messages > -- > > Key: SPARK-16533 > URL: https://issues.apache.org/jira/browse/SPARK-16533 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, > YARN >Affects Versions: 1.6.0 > Environment: Yarn version: Hadoop 2.7.1-amzn-0 > AWS EMR Cluster running: > 1 x r3.8xlarge (Master) > 52 x r3.8xlarge (Core) > Spark version : 1.6.0 > Scala version: 2.10.5 > Java version: 1.8.0_51 > Input size: ~10 tb > Input coming from S3 > Queue Configuration: > Dynamic allocation: enabled > Preemption: enabled > Q1: 70% capacity with max of 100% > Q2: 30% capacity with max of 100% > Job Configuration: > Driver memory = 10g > Executor cores = 6 > Executor memory = 10g > Deploy mode = cluster > Master = yarn > maxResultSize = 4g > Shuffle manager = hash >Reporter: Lucas Winkelmann >Assignee: Apache Spark > > Here is the scenario: > I launch job 1 into Q1 and allow it to grow to 100% cluster utilization. > I wait between 15-30 mins ( for this job to complete with 100% of the cluster > available takes about 1hr so job 1 is between 25-50% complete). Note that if > I wait less time then the issue sometimes does not occur, it appears to be > only after the job 1 is at least 25% complete. > I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to > allow 70% of cluster utilization. > At this point job 1 basically halts progress while job 2 continues to execute > as normal and finishes. Job 2 either: > - Fails its attempt and restarts. By the time this attempt fails the other > job is already complete meaning the second attempt has full cluster > availability and finishes. > - The job remains at its current progress and simply does not finish ( I have > waited ~6 hrs until finally killing the application ). > > Looking into the error log there is this constant error message: > WARN NettyRpcEndpointRef: Error sending message [message = > RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: > ip-NUMBERS.ec2.internal was preempted.)] in X attempts > > My observations have led me to believe that the application master does not > know about this container being killed and continuously asks the container to > remove the executor until eventually failing the attempt or continue trying > to remove the executor. > > I have done much digging online for anyone else experiencing this issue but > have come up with nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16533) Spark application not handling preemption messages
[ https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16533: Assignee: (was: Apache Spark) > Spark application not handling preemption messages > -- > > Key: SPARK-16533 > URL: https://issues.apache.org/jira/browse/SPARK-16533 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, > YARN >Affects Versions: 1.6.0 > Environment: Yarn version: Hadoop 2.7.1-amzn-0 > AWS EMR Cluster running: > 1 x r3.8xlarge (Master) > 52 x r3.8xlarge (Core) > Spark version : 1.6.0 > Scala version: 2.10.5 > Java version: 1.8.0_51 > Input size: ~10 tb > Input coming from S3 > Queue Configuration: > Dynamic allocation: enabled > Preemption: enabled > Q1: 70% capacity with max of 100% > Q2: 30% capacity with max of 100% > Job Configuration: > Driver memory = 10g > Executor cores = 6 > Executor memory = 10g > Deploy mode = cluster > Master = yarn > maxResultSize = 4g > Shuffle manager = hash >Reporter: Lucas Winkelmann > > Here is the scenario: > I launch job 1 into Q1 and allow it to grow to 100% cluster utilization. > I wait between 15-30 mins ( for this job to complete with 100% of the cluster > available takes about 1hr so job 1 is between 25-50% complete). Note that if > I wait less time then the issue sometimes does not occur, it appears to be > only after the job 1 is at least 25% complete. > I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to > allow 70% of cluster utilization. > At this point job 1 basically halts progress while job 2 continues to execute > as normal and finishes. Job 2 either: > - Fails its attempt and restarts. By the time this attempt fails the other > job is already complete meaning the second attempt has full cluster > availability and finishes. > - The job remains at its current progress and simply does not finish ( I have > waited ~6 hrs until finally killing the application ). > > Looking into the error log there is this constant error message: > WARN NettyRpcEndpointRef: Error sending message [message = > RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: > ip-NUMBERS.ec2.internal was preempted.)] in X attempts > > My observations have led me to believe that the application master does not > know about this container being killed and continuously asks the container to > remove the executor until eventually failing the attempt or continue trying > to remove the executor. > > I have done much digging online for anyone else experiencing this issue but > have come up with nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17150) Support SQL generation for inline tables
[ https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17150: Assignee: (was: Apache Spark) > Support SQL generation for inline tables > > > Key: SPARK-17150 > URL: https://issues.apache.org/jira/browse/SPARK-17150 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Peter Lee > > Inline tables currently do not support SQL generation, and as a result a view > that depends on inline tables would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17150) Support SQL generation for inline tables
[ https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17150: Assignee: Apache Spark > Support SQL generation for inline tables > > > Key: SPARK-17150 > URL: https://issues.apache.org/jira/browse/SPARK-17150 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Peter Lee >Assignee: Apache Spark > > Inline tables currently do not support SQL generation, and as a result a view > that depends on inline tables would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17150) Support SQL generation for inline tables
[ https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427589#comment-15427589 ] Apache Spark commented on SPARK-17150: -- User 'petermaxlee' has created a pull request for this issue: https://github.com/apache/spark/pull/14709 > Support SQL generation for inline tables > > > Key: SPARK-17150 > URL: https://issues.apache.org/jira/browse/SPARK-17150 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Peter Lee > > Inline tables currently do not support SQL generation, and as a result a view > that depends on inline tables would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17150) Support SQL generation for inline tables
Peter Lee created SPARK-17150: - Summary: Support SQL generation for inline tables Key: SPARK-17150 URL: https://issues.apache.org/jira/browse/SPARK-17150 Project: Spark Issue Type: New Feature Components: SQL Reporter: Peter Lee Inline tables currently do not support SQL generation, and as a result a view that depends on inline tables would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17145) Object with many fields causes Seq Serialization Bug
[ https://issues.apache.org/jira/browse/SPARK-17145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427565#comment-15427565 ] Liwei Lin commented on SPARK-17145: --- hi [~abdulla16] can you try https://github.com/apache/spark/pull/14698 out and see if it solves your problem? Thanks! > Object with many fields causes Seq Serialization Bug > - > > Key: SPARK-17145 > URL: https://issues.apache.org/jira/browse/SPARK-17145 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: OS: OSX El Capitan 10.11.6 >Reporter: Abdulla Al-Qawasmeh > > The unit test here > (https://gist.github.com/abdulla16/433faf7df59fce11a5fff284bac0d945) > describes the problem. > It looks like Spark is having problems serializing a Scala Seq when it's part > of an object with many fields (I'm not 100% sure it's a serialization > problem). The deserialized Seq ends up with as many items as the original > Seq, however, all the items are copies of the last item in the original Seq. > The object that I used in my unit test (as an example) is a Tuple5. However, > I've seen this behavior in other types of objects. > Reducing MyClass5 to only two fields (field34 and field35) causes the unit > test to pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427563#comment-15427563 ] Yanbo Liang commented on SPARK-17137: - I think we should provide transparent interface to users rather than exposing a param to control whether output dense/sparse coefficients. Spark MLlib {{Vector.compressed}} returns a vector in either dense or sparse format, whichever uses less storage. I would like to do the performance tests for this issue. Thanks! > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17149) array.sql for testing array related functions
[ https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427557#comment-15427557 ] Apache Spark commented on SPARK-17149: -- User 'petermaxlee' has created a pull request for this issue: https://github.com/apache/spark/pull/14708 > array.sql for testing array related functions > - > > Key: SPARK-17149 > URL: https://issues.apache.org/jira/browse/SPARK-17149 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17149) array.sql for testing array related functions
[ https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17149: Assignee: (was: Apache Spark) > array.sql for testing array related functions > - > > Key: SPARK-17149 > URL: https://issues.apache.org/jira/browse/SPARK-17149 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17149) array.sql for testing array related functions
[ https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17149: Assignee: Apache Spark > array.sql for testing array related functions > - > > Key: SPARK-17149 > URL: https://issues.apache.org/jira/browse/SPARK-17149 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16914) NodeManager crash when spark are registering executor infomartion into leveldb
[ https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427556#comment-15427556 ] cen yuhai commented on SPARK-16914: --- [~jerryshao] hi, saisai, I think SPARK-14963 is useless because function getRecoveryPath will choose the first directory in "yarn.nodemanager.local-dirs", it should be a random number > NodeManager crash when spark are registering executor infomartion into leveldb > -- > > Key: SPARK-16914 > URL: https://issues.apache.org/jira/browse/SPARK-16914 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: cen yuhai > > {noformat} > Stack: [0x7fb5b53de000,0x7fb5b54df000], sp=0x7fb5b54dcba8, free > space=1018k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > C [libc.so.6+0x896b1] memcpy+0x11 > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) > j > org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18 > j > org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36 > j > org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28 > j org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10 > j > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61 > J 8429 C2 > org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V > (100 bytes) @ 0x7fb5f27ff6cc [0x7fb5f27fdde0+0x18ec] > J 8371 C2 > org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (10 bytes) @ 0x7fb5f242df20 [0x7fb5f242de80+0xa0] > J 6853 C2 > io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (74 bytes) @ 0x7fb5f215587c [0x7fb5f21557e0+0x9c] > J 5872 C2 > io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (42 bytes) @ 0x7fb5f2183268 [0x7fb5f2183100+0x168] > J 5849 C2 > io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (158 bytes) @ 0x7fb5f2191524 [0x7fb5f218f5a0+0x1f84] > J 5941 C2 > org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V > (170 bytes) @ 0x7fb5f220a230 [0x7fb5f2209fc0+0x270] > J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V > (363 bytes) @ 0x7fb5f264465c [0x7fb5f2644140+0x51c] > J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @ > 0x7fb5f26f6764 [0x7fb5f26f63c0+0x3a4] > j io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13 > j java.lang.Thread.run()V+11 > v ~StubRoutines::call_stub > {noformat} > The target code in spark is in ExternalShuffleBlockResolver > {code} > /** Registers a new Executor with all the configuration we need to find its > shuffle files. */ > public void registerExecutor( > String appId, > String execId, > ExecutorShuffleInfo executorInfo) { > AppExecId fullId = new AppExecId(appId, execId); > logger.info("Registered executor {} with {}", fullId, executorInfo); > try { > if (db != null) { > byte[] key = dbAppExecKey(fullId); > byte[] value = > mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8); > db.put(key, value); > } > } catch (Exception e) { > logger.error("Error saving registered executors", e); > } > executors.put(fullId, executorInfo); > } > {code} > There is a problem with disk1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17149) array.sql for testing array related functions
Peter Lee created SPARK-17149: - Summary: array.sql for testing array related functions Key: SPARK-17149 URL: https://issues.apache.org/jira/browse/SPARK-17149 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Peter Lee -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17148) NodeManager exit because of exception “Executor is not registered”
cen yuhai created SPARK-17148: - Summary: NodeManager exit because of exception “Executor is not registered” Key: SPARK-17148 URL: https://issues.apache.org/jira/browse/SPARK-17148 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.6.2 Environment: hadoop 2.7.2 spark 1.6.2 Reporter: cen yuhai java.lang.RuntimeException: Executor is not registered (appId=application_1467288504738_1341061, execId=423) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427539#comment-15427539 ] Yanbo Liang commented on SPARK-17136: - I would like to know that users' own optimizers have some standard API similar with breeze {{LBFGS}} or others? > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427519#comment-15427519 ] Weichen Xu edited comment on SPARK-17139 at 8/19/16 3:05 AM: - I will work on it and create a PR when the dependent algorithm merged, thanks. was (Author: weichenxu123): I will work on it and create PR soon, thanks. > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17138) Python API for multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427518#comment-15427518 ] Weichen Xu edited comment on SPARK-17138 at 8/19/16 3:06 AM: - I will work on it and create a PR when the dependent algorithm merged, thanks. was (Author: weichenxu123): I will work on it and create PR soon, thanks. > Python API for multinomial logistic regression > -- > > Key: SPARK-17138 > URL: https://issues.apache.org/jira/browse/SPARK-17138 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, > we should make a Python API for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427529#comment-15427529 ] Yanbo Liang edited comment on SPARK-17134 at 8/19/16 3:04 AM: -- This is interesting. We also trying to use BLAS to accelerate linear algebra operations in other algorithms such as {{KMeans/ALS}} and I have some basic performance test result. I would like to contribute to this task after SPARK-7159 finished. Thanks! was (Author: yanboliang): This is interesting. We also trying to use BLAS to accelerate linear algebra operations in other algorithms such as {{KMeans/ALS}} and I have some basic performance test result. I would like to contribute to this task. Thanks! > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427529#comment-15427529 ] Yanbo Liang commented on SPARK-17134: - This is interesting. We also trying to use BLAS to accelerate linear algebra operations in other algorithms such as {{KMeans/ALS}} and I have some basic performance test result. I would like to contribute to this task. Thanks! > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427519#comment-15427519 ] Weichen Xu commented on SPARK-17139: I will work on it and create PR soon, thanks. > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17138) Python API for multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427518#comment-15427518 ] Weichen Xu commented on SPARK-17138: I will work on it and create PR soon, thanks. > Python API for multinomial logistic regression > -- > > Key: SPARK-17138 > URL: https://issues.apache.org/jira/browse/SPARK-17138 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, > we should make a Python API for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16947) Support type coercion and foldable expression for inline tables
[ https://issues.apache.org/jira/browse/SPARK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16947: Fix Version/s: 2.0.1 > Support type coercion and foldable expression for inline tables > --- > > Key: SPARK-16947 > URL: https://issues.apache.org/jira/browse/SPARK-16947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > > Inline tables were added in to Spark SQL in 2.0, e.g.: {{select * from values > (1, 'A'), (2, 'B') as tbl(a, b)}} > This is currently implemented using a {{LocalRelation}} and this relation is > created during parsing. This has several weaknesses: you can only use simple > expressions in such a plan, and type coercion is based on the first row in > the relation, and all subsequent values are cast in to this type. The latter > violates the principle of least surprise. > I would like to rewrite this into a union of projects; each of these projects > would contain a single table row. We apply better type coercion rules to a > union, and we should be able to rewrite this into a local relation during > optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17069) Expose spark.range() as table-valued function in SQL
[ https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427469#comment-15427469 ] Reynold Xin commented on SPARK-17069: - I've also backported this into branch-2.0 since it is a small testing util. > Expose spark.range() as table-valued function in SQL > > > Key: SPARK-17069 > URL: https://issues.apache.org/jira/browse/SPARK-17069 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > The idea here is to create the spark.range( x ) equivalent in SQL, so we can > do something like > {noformat} > select count(*) from range(1) > {noformat} > This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17069) Expose spark.range() as table-valued function in SQL
[ https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17069: Fix Version/s: 2.0.1 > Expose spark.range() as table-valued function in SQL > > > Key: SPARK-17069 > URL: https://issues.apache.org/jira/browse/SPARK-17069 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > The idea here is to create the spark.range( x ) equivalent in SQL, so we can > do something like > {noformat} > select count(*) from range(1) > {noformat} > This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
Robert Conrad created SPARK-17147: - Summary: Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets Key: SPARK-17147 URL: https://issues.apache.org/jira/browse/SPARK-17147 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 2.0.0 Reporter: Robert Conrad When Kafka does log compaction offsets often end up with gaps, meaning the next requested offset will be frequently not be offset+1. The logic in KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset will always be just an increment of 1 above the previous offset. I have worked around this problem by changing CachedKafkaConsumer to use the returned record's offset, from: {{nextOffset = offset + 1}} to: {{nextOffset = record.offset + 1}} and changed KafkaRDD from: {{requestOffset += 1}} to: {{requestOffset = r.offset() + 1}} (I also had to change some assert logic in CachedKafkaConsumer). There's a strong possibility that I have misconstrued how to use the streaming kafka consumer, and I'm happy to close this out if that's the case. If, however, it is supposed to support non-consecutive offsets (e.g. due to log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16947) Support type coercion and foldable expression for inline tables
[ https://issues.apache.org/jira/browse/SPARK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16947. - Resolution: Fixed > Support type coercion and foldable expression for inline tables > --- > > Key: SPARK-16947 > URL: https://issues.apache.org/jira/browse/SPARK-16947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Peter Lee > Fix For: 2.1.0 > > > Inline tables were added in to Spark SQL in 2.0, e.g.: {{select * from values > (1, 'A'), (2, 'B') as tbl(a, b)}} > This is currently implemented using a {{LocalRelation}} and this relation is > created during parsing. This has several weaknesses: you can only use simple > expressions in such a plan, and type coercion is based on the first row in > the relation, and all subsequent values are cast in to this type. The latter > violates the principle of least surprise. > I would like to rewrite this into a union of projects; each of these projects > would contain a single table row. We apply better type coercion rules to a > union, and we should be able to rewrite this into a local relation during > optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16947) Support type coercion and foldable expression for inline tables
[ https://issues.apache.org/jira/browse/SPARK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16947: Fix Version/s: 2.1.0 > Support type coercion and foldable expression for inline tables > --- > > Key: SPARK-16947 > URL: https://issues.apache.org/jira/browse/SPARK-16947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Peter Lee > Fix For: 2.1.0 > > > Inline tables were added in to Spark SQL in 2.0, e.g.: {{select * from values > (1, 'A'), (2, 'B') as tbl(a, b)}} > This is currently implemented using a {{LocalRelation}} and this relation is > created during parsing. This has several weaknesses: you can only use simple > expressions in such a plan, and type coercion is based on the first row in > the relation, and all subsequent values are cast in to this type. The latter > violates the principle of least surprise. > I would like to rewrite this into a union of projects; each of these projects > would contain a single table row. We apply better type coercion rules to a > union, and we should be able to rewrite this into a local relation during > optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427429#comment-15427429 ] Sital Kedia commented on SPARK-16922: - Kryo > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable
[ https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427425#comment-15427425 ] Qian Huang commented on SPARK-17090: Gotcha. I will do the api first. > Make tree aggregation level in linear/logistic regression configurable > -- > > Key: SPARK-17090 > URL: https://issues.apache.org/jira/browse/SPARK-17090 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Linear/logistic regression use treeAggregate with default aggregation depth > for collecting coefficient gradient updates to the driver. For high > dimensional problems, this can case OOM error on the driver. We should make > it configurable, perhaps via an expert param, so that users can avoid this > problem if their data has many features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427409#comment-15427409 ] Alberto Bonsanto commented on SPARK-17141: -- Crude data. | id|chicken|jam|roast beef| | 1|NaN|2.0| 2.0| | 2|2.0|0.0| 2.0| | 3|NaN|0.0| 2.0| | 4|2.0|1.0| -2.0| | 5|2.0|2.0| 2.0| | 6|2.0|2.0| NaN| After assemble and normalization, as you can see {{Double.NaN}} are replaced for {{0.5}}. |id |chicken|jam|roast beef|features |featuresNorm | |1 |NaN|2.0|2.0 |[NaN,2.0,2.0] |[0.5,1.0,1.0]| |2 |2.0|0.0|2.0 |[2.0,0.0,2.0] |[0.5,0.0,1.0]| |3 |NaN|0.0|2.0 |[NaN,0.0,2.0] |[0.5,0.0,1.0]| |4 |2.0|1.0|-2.0 |[2.0,1.0,-2.0]|[0.5,0.5,0.0]| |5 |2.0|2.0|2.0 |[2.0,2.0,2.0] |[0.5,1.0,1.0]| |6 |2.0|2.0|NaN |[2.0,2.0,NaN] |[0.5,1.0,NaN]| > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Priority: Trivial > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17146) Add RandomizedSearch to the CrossValidator API
Manoj Kumar created SPARK-17146: --- Summary: Add RandomizedSearch to the CrossValidator API Key: SPARK-17146 URL: https://issues.apache.org/jira/browse/SPARK-17146 Project: Spark Issue Type: Improvement Reporter: Manoj Kumar Hi, I would like to add randomized search support for the Cross-Validator API. It should be quite straightforward to add with the present abstractions. Here is the proposed API: (Names are up for debate) Proposed Classes: "ParamSamplerBuilder" or a "ParamRandomizedBuilder" that returns an Array of ParamMaps Proposed Methods: "addBounds" "addSampler" "setNumIter" Code example: {code} def sampler(): Double = { Math.pow(10.0, -5 + Random.nextFloat * (5 - (-5)) } val paramGrid = new ParamRandomizedBuilder() .addSampler(lr.regParam, sampler) .addBounds(lr.elasticNetParam, 0.0, 1.0) .setNumIter(10) .build() {code} Let me know your thoughts! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427394#comment-15427394 ] Andrew Davidson commented on SPARK-17143: - See email from user's group. I was able to find a work around. Not sure how hdfs:///tmp/ got created or how the permissions got messed up ## NICE CATCH!!! Many thanks. I spent all day on this bug The error msg report /tmp. I did not think to look on hdfs. [ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/ Found 1 items -rw-r--r-- 3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp [ec2-user@ip-172-31-22-140 notebooks]$ I have no idea how hdfs:///tmp got created. I deleted it. This causes a bunch of exceptions. These exceptions has useful message. I was able to fix the problem as follows $ hadoop fs -rmr hdfs:///tmp Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong $ hadoop fs -chmod 777 hdfs:///tmp/hive From: Felix Cheung Date: Thursday, August 18, 2016 at 3:37 PM To: Andrew Davidson , "user @spark" Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Do you have a file called tmp at / on HDFS? > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes =
[jira] [Created] (SPARK-17145) Object with many fields causes Seq Serialization Bug
Abdulla Al-Qawasmeh created SPARK-17145: --- Summary: Object with many fields causes Seq Serialization Bug Key: SPARK-17145 URL: https://issues.apache.org/jira/browse/SPARK-17145 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Environment: OS: OSX El Capitan 10.11.6 Reporter: Abdulla Al-Qawasmeh The unit test here (https://gist.github.com/abdulla16/433faf7df59fce11a5fff284bac0d945) describes the problem. It looks like Spark is having problems serializing a Scala Seq when it's part of an object with many fields (I'm not 100% sure it's a serialization problem). The deserialized Seq ends up with as many items as the original Seq, however, all the items are copies of the last item in the original Seq. The object that I used in my unit test (as an example) is a Tuple5. However, I've seen this behavior in other types of objects. Reducing MyClass5 to only two fields (field34 and field35) causes the unit test to pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan
[ https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17144: Assignee: (was: Apache Spark) > Removal of useless CreateHiveTableAsSelectLogicalPlan > - > > Key: SPARK-17144 > URL: https://issues.apache.org/jira/browse/SPARK-17144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan
[ https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427383#comment-15427383 ] Apache Spark commented on SPARK-17144: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14707 > Removal of useless CreateHiveTableAsSelectLogicalPlan > - > > Key: SPARK-17144 > URL: https://issues.apache.org/jira/browse/SPARK-17144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan
[ https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17144: Assignee: Apache Spark > Removal of useless CreateHiveTableAsSelectLogicalPlan > - > > Key: SPARK-17144 > URL: https://issues.apache.org/jira/browse/SPARK-17144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan
Xiao Li created SPARK-17144: --- Summary: Removal of useless CreateHiveTableAsSelectLogicalPlan Key: SPARK-17144 URL: https://issues.apache.org/jira/browse/SPARK-17144 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17081) Empty strings not preserved which causes SQLException: mismatching column value count
[ https://issues.apache.org/jira/browse/SPARK-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427380#comment-15427380 ] Xiao Li commented on SPARK-17081: - Can you try to reproduce it in Spark 2.0? Thanks! > Empty strings not preserved which causes SQLException: mismatching column > value count > - > > Key: SPARK-17081 > URL: https://issues.apache.org/jira/browse/SPARK-17081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Ian Hellstrom > Labels: dataframe, empty, jdbc, null, sql > > When writing a DataFrame that contains empty strings as values to an RDBMS, > the query that is generated does not have the correct column count: > {code} > CREATE TABLE demo(foo INTEGER, bar VARCHAR(10)); > - > case class Record(foo: Int, bar: String) > val data = sc.parallelize(List(Record(1, ""))).toDF > data.write.mode("append").jdbc(...) > {code} > This causes: > {code} > java.sql.SQLException: Column count doesn't match value count at row 1 > {code} > Proposal: leave empty strings as they are or convert these to NULL (although > that may not be what's intended by the user, so make this configurable). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427278#comment-15427278 ] Andrew Davidson commented on SPARK-17143: - given the exception metioned an issue with /tmp I decide to track how /tmp changed when run my cell # no spark jobs are running [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start notebook server $ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out & [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start the udfBug notebook [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # execute cell that define UDF [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9 [ec2-user@ip-172-31-22-140 notebooks]$ [ec2-user@ip-172-31-22-140 notebooks]$ find /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0 /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427264#comment-15427264 ] Davies Liu commented on SPARK-16922: Which serializer are you using? java serializer or Kyro? > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17143: Attachment: udfBug.html This html version of the notebook shows the output when run in my data center > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answer = self._gateway_client.send_command(command) >1063 return_value = get_return_value( > -> 10
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427259#comment-15427259 ] Sital Kedia commented on SPARK-16922: - >> Could you also try to disable the dense mode? I tried disabling the dense mode, that did not help either. > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17143: Attachment: udfBug.ipynb The attached notebook demonstrated the reported bug. Note it includes the output when run on my mac book pro. The bug report contains the stack trace when the same code is run in my data center > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answ
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427250#comment-15427250 ] Sital Kedia commented on SPARK-16922: - The failure is deterministic, we are reproducing the issue for every run of the job (Its not only one job, there are multiple jobs that are failing because of this). For now, we have made a change to not use the LongHashedRelation to workaround this issue. > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
Andrew Davidson created SPARK-17143: --- Summary: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Key: SPARK-17143 URL: https://issues.apache.org/jira/browse/SPARK-17143 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Environment: spark version: 1.6.1 python version: 3.4.3 (default, Apr 1 2015, 18:10:40) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] Reporter: Andrew Davidson For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp The notebook runs fine on my Mac In general I am able to run non UDF spark code with out any trouble I start the notebook server as the user “ec2-user" and uses master URL spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 I found the following message in the notebook server log file. I have log level set to warn 16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 #from pyspark.sql import SQLContext, HiveContext #sqlContext = SQLContext(sc) #from pyspark.sql import DataFrame #from pyspark.sql import functions from pyspark.sql.types import StringType from pyspark.sql.functions import udf print("spark version: {}".format(sc.version)) import sys print("python version: {}".format(sys.version)) spark version: 1.6.1 python version: 3.4.3 (default, Apr 1 2015, 18:10:40) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] # functions.lower() raises # py4j.Py4JException: Method lower([class java.lang.String]) does not exist # work around define a UDF toLowerUDFRetType = StringType() #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) toLowerUDF = udf(lambda s : s.lower(), StringType()) You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly Py4JJavaErrorTraceback (most recent call last) in () 4 toLowerUDFRetType = StringType() 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) 1595 [Row(slen=5), Row(slen=3)] 1596 """ -> 1597 return UserDefinedFunction(f, returnType) 1598 1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] /root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, name) 1556 self.returnType = returnType 1557 self._broadcast = None -> 1558 self._judf = self._create_judf(name) 1559 1560 def _create_judf(self, name): /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) 1567 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command, self) 1568 ctx = SQLContext.getOrCreate(sc) -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) 1570 if name is None: 1571 name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__ /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 681 try: 682 if not hasattr(self, '_scala_HiveContext'): --> 683 self._scala_HiveContext = self._get_hive_ctx() 684 return self._scala_HiveContext 685 except Py4JError as e: /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) 690 691 def _get_hive_ctx(self): --> 692 return self._jvm.HiveContext(self._jsc.sc()) 693 694 def refreshTable(self, tableName): /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1062 answer = self._gateway_client.send_command(command) 1063 return_value = get_return_value( -> 1064 answer, self._gateway_client, None, self._fqn) 1065 1066 for temp_arg in temp_args: /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred
[jira] [Comment Edited] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427241#comment-15427241 ] Davies Liu edited comment on SPARK-16922 at 8/18/16 9:58 PM: - Is this failure determistic or not? Happened on every task or some or them? Could you also try to disable the dense mode? was (Author: davies): Is this failure determistic or not? Happened on every task or some or them? > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427241#comment-15427241 ] Davies Liu commented on SPARK-16922: Is this failure determistic or not? Happened on every task or some or them? > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17142) Complex query triggers binding error in HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427236#comment-15427236 ] Josh Rosen commented on SPARK-17142: Interestingly, this query executes fine if the repeated addition in the SELECT clause is replaced by {{* 2}} instead. > Complex query triggers binding error in HashAggregateExec > - > > Key: SPARK-17142 > URL: https://issues.apache.org/jira/browse/SPARK-17142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > The following example runs successfully on Spark 2.0.0 but fails in the > current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not > earlier): > {code} > spark.sql("set spark.sql.crossJoin.enabled=true") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2") > val query = """ > SELECT > ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS > int_col_1 > FROM table_2 t1 > INNER JOIN ( > SELECT > LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), > -991) AS int_col, > (t2.int_col_1) + (t1.int_col_1) AS int_col_2, > (t1.int_col_1) + (t2.int_col_1) AS int_col_3, > t2.int_col_1 > FROM > table_4 t1, > table_4 t2 > GROUP BY > (t1.int_col_1) + (t2.int_col_1), > t2.int_col_1 > ) t2 > WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1) > GROUP BY (t2.int_col) + (t1.bigint_col_2) > """ > spark.sql(query).collect() > {code} > This fails with the following exception: > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: bigint_col_2#65 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.Traver
[jira] [Created] (SPARK-17142) Complex query triggers binding error in HashAggregateExec
Josh Rosen created SPARK-17142: -- Summary: Complex query triggers binding error in HashAggregateExec Key: SPARK-17142 URL: https://issues.apache.org/jira/browse/SPARK-17142 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Josh Rosen Priority: Blocker The following example runs successfully on Spark 2.0.0 but fails in the current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not earlier): {code} spark.sql("set spark.sql.crossJoin.enabled=true") sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2") val query = """ SELECT ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS int_col_1 FROM table_2 t1 INNER JOIN ( SELECT LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), -991) AS int_col, (t2.int_col_1) + (t1.int_col_1) AS int_col_2, (t1.int_col_1) + (t2.int_col_1) AS int_col_3, t2.int_col_1 FROM table_4 t1, table_4 t2 GROUP BY (t1.int_col_1) + (t2.int_col_1), t2.int_col_1 ) t2 WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1) GROUP BY (t2.int_col) + (t1.bigint_col_2) """ sql(query).collect() {code} This fails with the following exception: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: bigint_col_2#65 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:454) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce
[jira] [Updated] (SPARK-17142) Complex query triggers binding error in HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17142: --- Description: The following example runs successfully on Spark 2.0.0 but fails in the current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not earlier): {code} spark.sql("set spark.sql.crossJoin.enabled=true") sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2") val query = """ SELECT ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS int_col_1 FROM table_2 t1 INNER JOIN ( SELECT LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), -991) AS int_col, (t2.int_col_1) + (t1.int_col_1) AS int_col_2, (t1.int_col_1) + (t2.int_col_1) AS int_col_3, t2.int_col_1 FROM table_4 t1, table_4 t2 GROUP BY (t1.int_col_1) + (t2.int_col_1), t2.int_col_1 ) t2 WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1) GROUP BY (t2.int_col) + (t1.bigint_col_2) """ spark.sql(query).collect() {code} This fails with the following exception: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: bigint_col_2#65 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:454) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:145) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.
[jira] [Commented] (SPARK-17133) Improvements to linear methods in Spark
[ https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427144#comment-15427144 ] Xin Ren commented on SPARK-17133: - hi [~sethah] I'd like to help on this, please count me in. Thanks a lot :) > Improvements to linear methods in Spark > --- > > Key: SPARK-17133 > URL: https://issues.apache.org/jira/browse/SPARK-17133 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Seth Hendrickson > > This JIRA is for tracking several improvements that we should make to > Linear/Logistic regression in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427134#comment-15427134 ] Apache Spark commented on SPARK-16508: -- User 'junyangq' has created a pull request for this issue: https://github.com/apache/spark/pull/14705 > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16904) Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427069#comment-15427069 ] Tejas Patil commented on SPARK-16904: - Is Spark's hashing function semantically equivalent to Hive's ? AFAIK, its not. I think it would be better to have a mode to be able to use Hive's hash method. eg. case when this would be needed: Users running a query in Hive want to switch to Spark. As this happens, you want to verify if the data produced is same or not. Also, for a brief time the pipeline would run in both the engines. Upstream consumers of the data generated should not see differences due to running in the different engines > Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry > > > Key: SPARK-16904 > URL: https://issues.apache.org/jira/browse/SPARK-16904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, the Hive built-in `hash` function is not being used in Spark since > Spark 2.0. The public interface does not allow users to unregister the Spark > built-in functions. Thus, users will never use Hive's built-in `hash` > function. > The only exception here is `TestHiveFunctionRegistry`, which allows users to > unregister the built-in functions. Thus, we can load Hive's hash function in > the test cases. If we disable it, 10+ test cases will fail because the > results are different from the Hive golden answer files. > This PR is to remove `hash` from the list of `hiveFunctions` in > `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This > removal makes us easier to remove `TestHiveSessionState` in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16077) Python UDF may fail because of six
[ https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16077: - Fix Version/s: 1.6.3 > Python UDF may fail because of six > -- > > Key: SPARK-16077 > URL: https://issues.apache.org/jira/browse/SPARK-16077 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.3, 2.0.0 > > > six or other package may break pickle.whichmodule() in pickle: > https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
[ https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426988#comment-15426988 ] Sean Owen commented on SPARK-17141: --- Summarize the reproduction here? best to put it all here for the record. If you have a small fix and can describe it then someone else can commit it, though I think making a PR is a useful skill and not that hard. Worth taking a shot at it. > MinMaxScaler behaves weird when min and max have the same value and some > values are NaN > --- > > Key: SPARK-17141 > URL: https://issues.apache.org/jira/browse/SPARK-17141 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2, 2.0.0 > Environment: Databrick's Community, Spark 2.0 + Scala 2.10 >Reporter: Alberto Bonsanto >Priority: Trivial > > When you have a {{DataFrame}} with a column named {{features}}, which is a > {{DenseVector}} and the *maximum* and *minimum* and some values are > {{Double.NaN}} they get replaced by 0.5, and they should remain with the same > value, I believe. > I know how to fix it, but I haven't ever made a pull request. You can check > the bug in this > [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN
Alberto Bonsanto created SPARK-17141: Summary: MinMaxScaler behaves weird when min and max have the same value and some values are NaN Key: SPARK-17141 URL: https://issues.apache.org/jira/browse/SPARK-17141 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.0, 1.6.2 Environment: Databrick's Community, Spark 2.0 + Scala 2.10 Reporter: Alberto Bonsanto Priority: Trivial When you have a {{DataFrame}} with a column named {{features}}, which is a {{DenseVector}} and the *maximum* and *minimum* and some values are {{Double.NaN}} they get replaced by 0.5, and they should remain with the same value, I believe. I know how to fix it, but I haven't ever made a pull request. You can check the bug in this [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17132) binaryFiles method can't handle paths with embedded commas
[ https://issues.apache.org/jira/browse/SPARK-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426962#comment-15426962 ] Sean Owen commented on SPARK-17132: --- Yeah, that would be a solution. It actually affects all related API methods of SparkContext, not just one. I'm not clear if it's worth adding a bunch to the RDD API now in Spark 2, but it's not out of the question. It should work to escape the commas with \, or at least that's what the Hadoop classes appear to want done. I suppose that's the intended usage, though I also would prefer a more explicit seq argument. > binaryFiles method can't handle paths with embedded commas > -- > > Key: SPARK-17132 > URL: https://issues.apache.org/jira/browse/SPARK-17132 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, > 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: Maximilian Najork > > A path with an embedded comma is treated as two separate paths by > binaryFiles. Since commas are legal characters in paths, this behavior is > incorrect. I recommend overloading binaryFiles to accept an array of path > strings in addition to a string of comma-separated paths. Since setInputPaths > is already overloaded to accept either form, this should be relatively > low-effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17132) binaryFiles method can't handle paths with embedded commas
[ https://issues.apache.org/jira/browse/SPARK-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426965#comment-15426965 ] Maximilian Najork commented on SPARK-17132: --- I tried escaping the commas prior to filing this ticket and it still exhibited the behavior. It's possible I was doing something incorrectly. > binaryFiles method can't handle paths with embedded commas > -- > > Key: SPARK-17132 > URL: https://issues.apache.org/jira/browse/SPARK-17132 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, > 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: Maximilian Najork > > A path with an embedded comma is treated as two separate paths by > binaryFiles. Since commas are legal characters in paths, this behavior is > incorrect. I recommend overloading binaryFiles to accept an array of path > strings in addition to a string of comma-separated paths. Since setInputPaths > is already overloaded to accept either form, this should be relatively > low-effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17140) Add initial model to MultinomialLogisticRegression
Seth Hendrickson created SPARK-17140: Summary: Add initial model to MultinomialLogisticRegression Key: SPARK-17140 URL: https://issues.apache.org/jira/browse/SPARK-17140 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson We should add initial model support to Multinomial logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17138) Python API for multinomial logistic regression
Seth Hendrickson created SPARK-17138: Summary: Python API for multinomial logistic regression Key: SPARK-17138 URL: https://issues.apache.org/jira/browse/SPARK-17138 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, we should make a Python API for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17139) Add model summary for MultinomialLogisticRegression
Seth Hendrickson created SPARK-17139: Summary: Add model summary for MultinomialLogisticRegression Key: SPARK-17139 URL: https://issues.apache.org/jira/browse/SPARK-17139 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson Add model summary to multinomial logistic regression using same interface as in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
Seth Hendrickson created SPARK-17137: Summary: Add compressed support for multinomial logistic regression coefficients Key: SPARK-17137 URL: https://issues.apache.org/jira/browse/SPARK-17137 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson Priority: Minor For sparse coefficients in MLOR, such as when high L1 regularization, it may be more efficient to store coefficients in compressed format. We can add this option to MLOR and perhaps to do some performance tests to verify improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17136) Design optimizer interface for ML algorithms
Seth Hendrickson created SPARK-17136: Summary: Design optimizer interface for ML algorithms Key: SPARK-17136 URL: https://issues.apache.org/jira/browse/SPARK-17136 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson We should consider designing an interface that allows users to use their own optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17133) Improvements to linear methods in Spark
[ https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-17133: - Description: This JIRA is for tracking several improvements that we should make to Linear/Logistic regression in Spark. (was: This JIRA is for tracking several improvements that we should make to Linear/Logistic regression in Spark. Many of them are follow ups to [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159].) > Improvements to linear methods in Spark > --- > > Key: SPARK-17133 > URL: https://issues.apache.org/jira/browse/SPARK-17133 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Seth Hendrickson > > This JIRA is for tracking several improvements that we should make to > Linear/Logistic regression in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17135) Consolidate code in linear/logistic regression where possible
Seth Hendrickson created SPARK-17135: Summary: Consolidate code in linear/logistic regression where possible Key: SPARK-17135 URL: https://issues.apache.org/jira/browse/SPARK-17135 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson Priority: Minor There is shared code between MultinomialLogisticRegression, LogisticRegression, and LinearRegression. We should consolidate where possible. Also, we should move some code out of LogisticRegression.scala into a separate util file or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable
[ https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-17090: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-17133 > Make tree aggregation level in linear/logistic regression configurable > -- > > Key: SPARK-17090 > URL: https://issues.apache.org/jira/browse/SPARK-17090 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Linear/logistic regression use treeAggregate with default aggregation depth > for collecting coefficient gradient updates to the driver. For high > dimensional problems, this can case OOM error on the driver. We should make > it configurable, perhaps via an expert param, so that users can avoid this > problem if their data has many features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
Seth Hendrickson created SPARK-17134: Summary: Use level 2 BLAS operations in LogisticAggregator Key: SPARK-17134 URL: https://issues.apache.org/jira/browse/SPARK-17134 Project: Spark Issue Type: Sub-task Components: ML Reporter: Seth Hendrickson Multinomial logistic regression uses LogisticAggregator class for gradient updates. We should look into refactoring MLOR to use level 2 BLAS operations for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17133) Improvements to linear methods in Spark
Seth Hendrickson created SPARK-17133: Summary: Improvements to linear methods in Spark Key: SPARK-17133 URL: https://issues.apache.org/jira/browse/SPARK-17133 Project: Spark Issue Type: Umbrella Components: ML, MLlib Reporter: Seth Hendrickson This JIRA is for tracking several improvements that we should make to Linear/Logistic regression in Spark. Many of them are follow ups to [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17132) binaryFiles method can't handle paths with embedded commas
Maximilian Najork created SPARK-17132: - Summary: binaryFiles method can't handle paths with embedded commas Key: SPARK-17132 URL: https://issues.apache.org/jira/browse/SPARK-17132 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0 Reporter: Maximilian Najork A path with an embedded comma is treated as two separate paths by binaryFiles. Since commas are legal characters in paths, this behavior is incorrect. I recommend overloading binaryFiles to accept an array of path strings in addition to a string of comma-separated paths. Since setInputPaths is already overloaded to accept either form, this should be relatively low-effort. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16981) For CSV files nullValue is not respected for Date/Time data type
[ https://issues.apache.org/jira/browse/SPARK-16981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lev updated SPARK-16981: Priority: Critical (was: Major) > For CSV files nullValue is not respected for Date/Time data type > > > Key: SPARK-16981 > URL: https://issues.apache.org/jira/browse/SPARK-16981 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Lev >Priority: Critical > > Test case > val struct = StructType(Seq(StructField("col1", StringType, > true),StructField("col2", TimestampType, true), Seq(StructField("col3", > StringType, true))) > val cq = sqlContext.readStream > .format("csv") > .option("nullValue", " ") > .schema(struct) > .load(s"somepath") > .writeStream() > content of the file > "abc", ,"def" > Result: > Exception is thrown: > scala.MatchError: java.lang.IllegalArgumentException: Timestamp format must > be -mm-dd hh:mm:ss[.f] (of class > java.lang.IllegalArgumentException) > Code analysis: > Problem is caused by code in castTo method of CSVTypeCast object > For all data types except temporal there is the following check: > if (datum == options.nullValue && nullable) { > null > } > But for temporal types it is missing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable
[ https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426876#comment-15426876 ] DB Tsai commented on SPARK-17090: - Since having a formula of determining the aggregation depth is pretty tricky, and this will depend on the memory setting of driver, the dimension of problems, and the number of partition, etc. This will take longer to discuss and have a proper implementation. Let's have the api done in this PR, and set the default value as 2.0. In a follow-up PR, we can work on the formula part. > Make tree aggregation level in linear/logistic regression configurable > -- > > Key: SPARK-17090 > URL: https://issues.apache.org/jira/browse/SPARK-17090 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Linear/logistic regression use treeAggregate with default aggregation depth > for collecting coefficient gradient updates to the driver. For high > dimensional problems, this can case OOM error on the driver. We should make > it configurable, perhaps via an expert param, so that users can avoid this > problem if their data has many features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core
[ https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426858#comment-15426858 ] Tejas Patil commented on SPARK-15694: - PR for part #1 : https://github.com/apache/spark/pull/14702 > Implement ScriptTransformation in sql/core > -- > > Key: SPARK-15694 > URL: https://issues.apache.org/jira/browse/SPARK-15694 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices
[ https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426821#comment-15426821 ] Jon Zhong commented on SPARK-17130: --- Thanks for posting the code. The problem is solved clearly. > SparseVectors.apply and SparseVectors.toArray have different returns when > creating with a illegal indices > - > > Key: SPARK-17130 > URL: https://issues.apache.org/jira/browse/SPARK-17130 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.2, 2.0.0 > Environment: spark 1.6.1 + scala >Reporter: Jon Zhong >Priority: Minor > > One of my colleagues ran into a bug of SparseVectors. He called the > Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without > noticing that the indices are assumed to be ordered. > The vector he created has all value of 0.0 (without any warning), if we try > to get value via apply method. However, SparseVector.toArray will generates a > array using a method that is order insensitive. Hence, you will get a 0.0 > when you call apply method, while you can get correct result using toArray or > toDense method. The result of SparseVector.toArray is actually misleading. > It could be safer if there is a validation of indices in the constructor or > at least make the returns of apply method and toArray method the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15694) Implement ScriptTransformation in sql/core
[ https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15694: Assignee: (was: Apache Spark) > Implement ScriptTransformation in sql/core > -- > > Key: SPARK-15694 > URL: https://issues.apache.org/jira/browse/SPARK-15694 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core
[ https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426815#comment-15426815 ] Apache Spark commented on SPARK-15694: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/14702 > Implement ScriptTransformation in sql/core > -- > > Key: SPARK-15694 > URL: https://issues.apache.org/jira/browse/SPARK-15694 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15694) Implement ScriptTransformation in sql/core
[ https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15694: Assignee: Apache Spark > Implement ScriptTransformation in sql/core > -- > > Key: SPARK-15694 > URL: https://issues.apache.org/jira/browse/SPARK-15694 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426807#comment-15426807 ] Shivaram Venkataraman commented on SPARK-16581: --- I am not sure the issues are very related though 1. The JVM->R access methods are mostly to call into any Java method (like say in SystemML). I think we have reasonable clarity on what to make public here which is callJMethod and callJStatic. There is also some discussion on supporting custom GC using cleanup.jobj in the SPARK-16611 2. The RDD / RBackend are not directly related to this I think. The RDD ones are about our UDFs not having some features right now and we can continue discussing that in SPARK-16611 or other JIRAs ? > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-16581: -- Comment: was deleted (was: I am not sure the issues are very related though 1. The JVM->R access methods are mostly to call into any Java method (like say in SystemML). I think we have reasonable clarity on what to make public here which is callJMethod and callJStatic. There is also some discussion on supporting custom GC using cleanup.jobj in the SPARK-16611 2. The RDD / RBackend are not directly related to this I think. The RDD ones are about our UDFs not having some features right now and we can continue discussing that in SPARK-16611 or other JIRAs ?) > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426806#comment-15426806 ] Shivaram Venkataraman commented on SPARK-16581: --- I am not sure the issues are very related though 1. The JVM->R access methods are mostly to call into any Java method (like say in SystemML). I think we have reasonable clarity on what to make public here which is callJMethod and callJStatic. There is also some discussion on supporting custom GC using cleanup.jobj in the SPARK-16611 2. The RDD / RBackend are not directly related to this I think. The RDD ones are about our UDFs not having some features right now and we can continue discussing that in SPARK-16611 or other JIRAs ? > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
[ https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426801#comment-15426801 ] Iaroslav Zeigerman commented on SPARK-17131: Having a different exception when trying to apply mean function to all columns: {code} val allCols = df.columns.map(c => mean(c)) val newDf = df.select(allCols: _*) newDf.show() {code} {noformat} java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1383) at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:555) at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:518) at org.codehaus.janino.util.ClassFile.(ClassFile.java:185) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) ... {noformat} > Code generation fails when running SQL expressions against a wide dataset > (thousands of columns) > > > Key: SPARK-17131 > URL: https://issues.apache.org/jira/browse/SPARK-17131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > > When reading the CSV file that contains 1776 columns Spark and Janino fail to > generate the code with message: > {noformat} > Constant pool has grown past JVM limit of 0x > {noformat} > When running a common select with all columns it's fine: > {code} > val allCols = df.columns.map(c => col(c).as(c + "_alias")) > val newDf = df.select(allCols: _*) > newDf.show() > {code} > But when I invoke the describe method: > {code} > newDf.describe(allCols: _*) > {code} > it fails with the following stack trace: > {noformat} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 30 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has > grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) > at > org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) > at > org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) > at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) > at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) > at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975) > at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.ja
[jira] [Commented] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication
[ https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426798#comment-15426798 ] Shivaram Venkataraman commented on SPARK-6832: -- I think we can add a new method `readBinFully` and then replace calls to `readBin` with that method. Regarding simulating this -- I think you could try to manually send a signal (using something like kill -s SIGCHLD) to an R process while it is reading a large amount of data using readBin. > Handle partial reads in SparkR JVM to worker communication > -- > > Key: SPARK-6832 > URL: https://issues.apache.org/jira/browse/SPARK-6832 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Minor > > After we move to use socket between R worker and JVM, it's possible that > readBin() in R will return partial results (for example, interrupted by > signal). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices
[ https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17130. --- Resolution: Duplicate Oh yeah but along the way the validation is also all moved into the constructor. That was actually the last comment on the PR -- sorry thought that's what you saw and were even responding to. See https://github.com/apache/spark/pull/14555/files#diff-84f492e3a9c1febe833709960069b1b2R553 I think the issue was that Vectors.sparse does validate but new SparseVector() does not? well, both will be validated now. I'll say this is a duplicate because we should definitely resolve both at once. > SparseVectors.apply and SparseVectors.toArray have different returns when > creating with a illegal indices > - > > Key: SPARK-17130 > URL: https://issues.apache.org/jira/browse/SPARK-17130 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.2, 2.0.0 > Environment: spark 1.6.1 + scala >Reporter: Jon Zhong >Priority: Minor > > One of my colleagues ran into a bug of SparseVectors. He called the > Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without > noticing that the indices are assumed to be ordered. > The vector he created has all value of 0.0 (without any warning), if we try > to get value via apply method. However, SparseVector.toArray will generates a > array using a method that is order insensitive. Hence, you will get a 0.0 > when you call apply method, while you can get correct result using toArray or > toDense method. The result of SparseVector.toArray is actually misleading. > It could be safer if there is a validation of indices in the constructor or > at least make the returns of apply method and toArray method the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices
[ https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426780#comment-15426780 ] Jon Zhong commented on SPARK-17130: --- Yep, I wrote a comment there but I deleted since I'm not sure whether they are fixing this problem together. The problem mentioned at SPARK-16965 is more about negative indices. Are they also concerning about unordered indices? > SparseVectors.apply and SparseVectors.toArray have different returns when > creating with a illegal indices > - > > Key: SPARK-17130 > URL: https://issues.apache.org/jira/browse/SPARK-17130 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.2, 2.0.0 > Environment: spark 1.6.1 + scala >Reporter: Jon Zhong >Priority: Minor > > One of my colleagues ran into a bug of SparseVectors. He called the > Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without > noticing that the indices are assumed to be ordered. > The vector he created has all value of 0.0 (without any warning), if we try > to get value via apply method. However, SparseVector.toArray will generates a > array using a method that is order insensitive. Hence, you will get a 0.0 > when you call apply method, while you can get correct result using toArray or > toDense method. The result of SparseVector.toArray is actually misleading. > It could be safer if there is a validation of indices in the constructor or > at least make the returns of apply method and toArray method the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
Iaroslav Zeigerman created SPARK-17131: -- Summary: Code generation fails when running SQL expressions against a wide dataset (thousands of columns) Key: SPARK-17131 URL: https://issues.apache.org/jira/browse/SPARK-17131 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Iaroslav Zeigerman When reading the CSV file that contains 1776 columns Spark and Janino fail to generate the code with message: {noformat} Constant pool has grown past JVM limit of 0x {noformat} When running a common select with all columns it's fine: {code} val allCols = df.columns.map(c => col(c).as(c + "_alias")) val newDf = df.select(allCols: _*) newDf.show() {code} But when I invoke the describe method: {code} newDf.describe(allCols: _*) {code} it fails with the following stack trace: {noformat} at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ... 30 more Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0x at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) at org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) at org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975) at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662) at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable
[ https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426562#comment-15426562 ] Seth Hendrickson commented on SPARK-17090: -- I'm not working on it. Please feel free to take it! > Make tree aggregation level in linear/logistic regression configurable > -- > > Key: SPARK-17090 > URL: https://issues.apache.org/jira/browse/SPARK-17090 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Linear/logistic regression use treeAggregate with default aggregation depth > for collecting coefficient gradient updates to the driver. For high > dimensional problems, this can case OOM error on the driver. We should make > it configurable, perhaps via an expert param, so that users can avoid this > problem if their data has many features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426549#comment-15426549 ] Barry Becker commented on SPARK-17086: -- I think I agree with the discussion. Here is a summary of the conclusions just to check my understanding: - It's fine for appxQuantile to return duplicate splits. It should always return the requested number of quantiles corresponding to the length of the probabilities array pased to it. - QuantileBucketizer, on the other hand, may return fewer than the number of buckets requested. It should not give an error when the number of buckets requested is fewer than the number of distinct values. If the call to appxQuartile returns duplicate splits, just discard the duplicates when passing the splits to QBucketizer. This saves you from having to compute unique values first in order to check to see if that number is less that the requested number of bins. I think its fine that QBucketizer work this way. You want it to be robust and not give errors for edge cases like this. The objective is to return buckets that have as close to equal weight bins as possible with simple split values. If the data was \[1,1,1,1,1,1,1,1,4,5,10\] and I asked for 10 bins, then I would expect the splits to be \[-Inf, 1, 4, 5, 10, Inf\] even though the mean is 1 and appxQuartile returned 1 repeated several time. If I asked for 2 bins, then I think the splits might be \[-Inf, 1, 4, Inf\]. If three bins are requested, would you get \[-Inf, 1, 4, 5, Inf] or [-Inf, 1, 4, 10, Inf\]? Maybe, in cases like this you should get \[-Inf, 1, 4, 5, 10, Inf\] even though only 3 bins were requested. In other words, if there are only a small number of unique integer values in the data, and the number of bins is slightly less than that number, maybe it should be increased to match it since that is likely to be more meaningful. For now, just removing duplicates is probably enough. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices
[ https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426490#comment-15426490 ] Sean Owen commented on SPARK-17130: --- Yeah, didn't you just comment on https://github.com/apache/spark/pull/14555 ? that's already being fixed there. This is a duplicate of SPARK-16965 > SparseVectors.apply and SparseVectors.toArray have different returns when > creating with a illegal indices > - > > Key: SPARK-17130 > URL: https://issues.apache.org/jira/browse/SPARK-17130 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.2, 2.0.0 > Environment: spark 1.6.1 + scala >Reporter: Jon Zhong >Priority: Minor > > One of my colleagues ran into a bug of SparseVectors. He called the > Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without > noticing that the indices are assumed to be ordered. > The vector he created has all value of 0.0 (without any warning), if we try > to get value via apply method. However, SparseVector.toArray will generates a > array using a method that is order insensitive. Hence, you will get a 0.0 > when you call apply method, while you can get correct result using toArray or > toDense method. The result of SparseVector.toArray is actually misleading. > It could be safer if there is a validation of indices in the constructor or > at least make the returns of apply method and toArray method the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices
Jon Zhong created SPARK-17130: - Summary: SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices Key: SPARK-17130 URL: https://issues.apache.org/jira/browse/SPARK-17130 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.0.0, 1.6.2 Environment: spark 1.6.1 + scala Reporter: Jon Zhong Priority: Minor One of my colleagues ran into a bug of SparseVectors. He called the Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without noticing that the indices are assumed to be ordered. The vector he created has all value of 0.0 (without any warning), if we try to get value via apply method. However, SparseVector.toArray will generates a array using a method that is order insensitive. Hence, you will get a 0.0 when you call apply method, while you can get correct result using toArray or toDense method. The result of SparseVector.toArray is actually misleading. It could be safer if there is a validation of indices in the constructor or at least make the returns of apply method and toArray method the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org