[jira] [Commented] (SPARK-13331) Spark network encryption optimization
[ https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151892#comment-15151892 ] Dong Chen commented on SPARK-13331: --- Sorry for the confusion, below is the change would entail in Spark. Please let me know if any confusions. SPARK-6229 add SASL encryption to network library. The encryption support 3DES, DES, and RC4. This JIRA intends to make the encryption support AES for better performance. The change in Spark would involve: * add code in {{SaslClientBootstrap.doBootstrap()}} and {{SaslRpcHandler.receive()}} to negotiate AES encryption. * Then update {{SparkSaslClient}} and {{SparkSaslServer}} to {{wrap}} / {{unwrap}} message with AES. SPARK-10771 and this JIRA has same point that they both use JCE or a library to implement AES. But have different focus in Spark. SPARK-10771 is for data encryption when writing / reading shuffle data to disk, and this JIRA is for data encryption when transfering data on wire. > Spark network encryption optimization > - > > Key: SPARK-13331 > URL: https://issues.apache.org/jira/browse/SPARK-13331 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Dong Chen >Priority: Minor > > In network/common, SASL with DIGEST-MD5 authentication is used for > negotiating a secure communication channel. When SASL operation mode is > "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 > mechanism supports following encryption: 3DES, DES, and RC4. The negotiation > procedure will select one of them to encrypt / decrypt the data on the > channel. > However, 3des and rc4 are slow relatively. We could add code in the > negotiation to make it support AES for more secure and performance. > The proposed solution is: > When "auth-conf" is enabled, at the end of original negotiation, the > authentication succeeds and a secure channel is built. We could add one more > negotiation step: Client and server negotiate whether they both support AES. > If yes, the Key and IV used by AES will be generated by server and sent to > client through the already secure channel. Then update the encryption / > decryption handler to AES at both client and server side. Following data > transfer will use AES instead of original encryption algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151880#comment-15151880 ] Max Seiden commented on SPARK-12449: Yea, that seems to be the case. There's code in the DataSourceStrategy that specifically resolves aliases, but the filtered scan case is pretty narrow relative to an expression tree. +1 for a generic way to avoid double execution of operations. On the flip, a boolean check would drop a neat property of "unhandledFilters" which is that it can accept a subset of what the planner tries to push down. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13373) Generate code for sort merge join
Davies Liu created SPARK-13373: -- Summary: Generate code for sort merge join Key: SPARK-13373 URL: https://issues.apache.org/jira/browse/SPARK-13373 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
[ https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-13354. -- Resolution: Duplicate > Push filter throughout outer join when the condition can filter out empty row > -- > > Key: SPARK-13354 > URL: https://issues.apache.org/jira/browse/SPARK-13354 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > For a query > {code} > select * from a left outer join b on a.a = b.a where b.b > 10 > {code} > The condition `b.b > 10` will filter out all the row that the b part of it is > empty. > In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13370) Lexer not handling whitespaces properly
[ https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151822#comment-15151822 ] Herman van Hovell edited comment on SPARK-13370 at 2/18/16 7:35 AM: Whitespace is optional. This may sound funny, but the following expression is perfectly legal: {{select 1+1}}. In 99% of the time cases without proper whitespace get caught by either eager lexer rules or parse rules. -I'll leave it open for discussion, but I think this is a won't fix.- was (Author: hvanhovell): Whitespace is optional. This may sound funny, but the following expression is perfectly legal: {{select 1+1}}. In 99% of the time cases without proper whitespace get caught by either eager lexer rules or parse rules. I'll leave it open for discussion, but I think this is a won't fix. > Lexer not handling whitespaces properly > --- > > Key: SPARK-13370 > URL: https://issues.apache.org/jira/browse/SPARK-13370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > I was experimenting numeric suffixes and came to the following wrong query > string, which should result in a parsing error: > {code} > // 1.0L is illegal here > sqlContext.sql("SELECT 1.0D + 1.0L").show() > {code} > However, it gives the following result: > {noformat} > +---+ > | L| > +---+ > |2.0| > +---+ > {noformat} > {{explain}} suggests that the {{L}} is recognized as an alias: > {noformat} > sqlContext.sql("SELECT 1.0D + 1.0L").explain(true) > == Parsed Logical Plan == > 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)] > +- OneRowRelation$ > {noformat} > Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's > an whitespace there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13370) Lexer not handling whitespaces properly
[ https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151870#comment-15151870 ] Herman van Hovell commented on SPARK-13370: --- Thought about this a bit more, and realized we can fix this by explicitly adding a whitespace token to the alias section of the namedExpression rule. I'll submit a PR and the end of the day. > Lexer not handling whitespaces properly > --- > > Key: SPARK-13370 > URL: https://issues.apache.org/jira/browse/SPARK-13370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > I was experimenting numeric suffixes and came to the following wrong query > string, which should result in a parsing error: > {code} > // 1.0L is illegal here > sqlContext.sql("SELECT 1.0D + 1.0L").show() > {code} > However, it gives the following result: > {noformat} > +---+ > | L| > +---+ > |2.0| > +---+ > {noformat} > {{explain}} suggests that the {{L}} is recognized as an alias: > {noformat} > sqlContext.sql("SELECT 1.0D + 1.0L").explain(true) > == Parsed Logical Plan == > 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)] > +- OneRowRelation$ > {noformat} > Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's > an whitespace there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13371) TaskSetManager.dequeueSpeculativeTask compares Option[String] and String directly.
[ https://issues.apache.org/jira/browse/SPARK-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-13371: Summary: TaskSetManager.dequeueSpeculativeTask compares Option[String] and String directly. (was: Compare Option[String] and String directly in ) > TaskSetManager.dequeueSpeculativeTask compares Option[String] and String > directly. > -- > > Key: SPARK-13371 > URL: https://issues.apache.org/jira/browse/SPARK-13371 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.2, 1.6.0 >Reporter: Guoqiang Li > > {noformat} > TaskSetManager.dequeueSpeculativeTask compares Option[String] and String > directly. > {noformat} > Ths code: > https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344 > {code} > if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { > for (rack <- sched.getRackForHost(host)) { > for (index <- speculatableTasks if canRunOnHost(index)) { > val racks = > tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost) > // racks: Seq[Option[String]] and rack: String > if (racks.contains(rack)) { > speculatableTasks -= index > return Some((index, TaskLocality.RACK_LOCAL)) > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13371) Compare Option[String] and String directly in
[ https://issues.apache.org/jira/browse/SPARK-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-13371: Summary: Compare Option[String] and String directly in (was: Compare Option[String] and String directly) > Compare Option[String] and String directly in > -- > > Key: SPARK-13371 > URL: https://issues.apache.org/jira/browse/SPARK-13371 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.2, 1.6.0 >Reporter: Guoqiang Li > > {noformat} > TaskSetManager.dequeueSpeculativeTask compares Option[String] and String > directly. > {noformat} > Ths code: > https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344 > {code} > if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { > for (rack <- sched.getRackForHost(host)) { > for (index <- speculatableTasks if canRunOnHost(index)) { > val racks = > tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost) > // racks: Seq[Option[String]] and rack: String > if (racks.contains(rack)) { > speculatableTasks -= index > return Some((index, TaskLocality.RACK_LOCAL)) > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13371) Compare Option[String] and String directly
[ https://issues.apache.org/jira/browse/SPARK-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-13371: Description: {noformat} TaskSetManager.dequeueSpeculativeTask compares Option[String] and String directly. {noformat} Ths code: https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344 {code} if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { for (rack <- sched.getRackForHost(host)) { for (index <- speculatableTasks if canRunOnHost(index)) { val racks = tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost) // racks: Seq[Option[String]] and rack: String if (racks.contains(rack)) { speculatableTasks -= index return Some((index, TaskLocality.RACK_LOCAL)) } } } } {code} was: {noformat} TaskSetManager.dequeueSpeculativeTask compares Option[String] and String directly. {noformat} Ths code: https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344 {code} if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { for (rack <- sched.getRackForHost(host)) { for (index <- speculatableTasks if canRunOnHost(index)) { val racks = tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost) // racks: Seq[Option[String]] and rack: String if (racks.contains(rack)) { speculatableTasks -= index return Some((index, TaskLocality.RACK_LOCAL)) } } } } {code} > Compare Option[String] and String directly > -- > > Key: SPARK-13371 > URL: https://issues.apache.org/jira/browse/SPARK-13371 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.2, 1.6.0 >Reporter: Guoqiang Li > > {noformat} > TaskSetManager.dequeueSpeculativeTask compares Option[String] and String > directly. > {noformat} > Ths code: > https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344 > {code} > if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { > for (rack <- sched.getRackForHost(host)) { > for (index <- speculatableTasks if canRunOnHost(index)) { > val racks = > tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost) > // racks: Seq[Option[String]] and rack: String > if (racks.contains(rack)) { > speculatableTasks -= index > return Some((index, TaskLocality.RACK_LOCAL)) > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13331) Spark network encryption optimization
[ https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Chen updated SPARK-13331: -- Description: In network/common, SASL with DIGEST-MD5 authentication is used for negotiating a secure communication channel. When SASL operation mode is "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 mechanism supports following encryption: 3DES, DES, and RC4. The negotiation procedure will select one of them to encrypt / decrypt the data on the channel. However, 3des and rc4 are slow relatively. We could add code in the negotiation to make it support AES for more secure and performance. The proposed solution is: When "auth-conf" is enabled, at the end of original negotiation, the authentication succeeds and a secure channel is built. We could add one more negotiation step: Client and server negotiate whether they both support AES. If yes, the Key and IV used by AES will be generated by server and sent to client through the already secure channel. Then update the encryption / decryption handler to AES at both client and server side. Following data transfer will use AES instead of original encryption algorithm. was: In network/common, SASL with DIGEST-MD5 authentication is used for negotiating a secure communication channel. When SASL operation mode is "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 mechanism supports following encryption: 3DES, DES, and RC4. The negotiation procedure will select one of them to encrypt / decrypt the data on the channel. However, 3des and rc4 are slow relatively. We could add code in the negotiation to make it support AES for more secure and performance. The proposal is: When "auth-conf" is enabled, at the end of original negotiation, the authentication succeeds and a secure channel is built. We could add one more negotiation step: Client and server negotiate whether they both support AES. If yes, the Key and IV used by AES will be generated by server and sent to client through the already secure channel. Then update the encryption / decryption handler to AES at both client and server side. Following data transfer will use AES instead of original encryption algorithm. > Spark network encryption optimization > - > > Key: SPARK-13331 > URL: https://issues.apache.org/jira/browse/SPARK-13331 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Dong Chen >Priority: Minor > > In network/common, SASL with DIGEST-MD5 authentication is used for > negotiating a secure communication channel. When SASL operation mode is > "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 > mechanism supports following encryption: 3DES, DES, and RC4. The negotiation > procedure will select one of them to encrypt / decrypt the data on the > channel. > However, 3des and rc4 are slow relatively. We could add code in the > negotiation to make it support AES for more secure and performance. > The proposed solution is: > When "auth-conf" is enabled, at the end of original negotiation, the > authentication succeeds and a secure channel is built. We could add one more > negotiation step: Client and server negotiate whether they both support AES. > If yes, the Key and IV used by AES will be generated by server and sent to > client through the already secure channel. Then update the encryption / > decryption handler to AES at both client and server side. Following data > transfer will use AES instead of original encryption algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151823#comment-15151823 ] Evan Chan commented on SPARK-12449: --- I think in the case of sources.Expressions, by the time they are pushed down, all aliases etc should have been resolved already, so that should not be an issue, right? Agree that capabilities would be important. If that didn’t exist, then the default would be to not compute the expressions and let Spark’s default aggregators do it, which means it would be like the filtering today where there is double filtering. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0
[ https://issues.apache.org/jira/browse/SPARK-13372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151824#comment-15151824 ] Apache Spark commented on SPARK-13372: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/11247 > ML LogisticRegression behaves incorrectly when standardization = false && > regParam = 0.0 > - > > Key: SPARK-13372 > URL: https://issues.apache.org/jira/browse/SPARK-13372 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > ML LogisticRegression behaves incorrectly when standardization = false && > regParam = 0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0
[ https://issues.apache.org/jira/browse/SPARK-13372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13372: Assignee: (was: Apache Spark) > ML LogisticRegression behaves incorrectly when standardization = false && > regParam = 0.0 > - > > Key: SPARK-13372 > URL: https://issues.apache.org/jira/browse/SPARK-13372 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > ML LogisticRegression behaves incorrectly when standardization = false && > regParam = 0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0
[ https://issues.apache.org/jira/browse/SPARK-13372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13372: Assignee: Apache Spark > ML LogisticRegression behaves incorrectly when standardization = false && > regParam = 0.0 > - > > Key: SPARK-13372 > URL: https://issues.apache.org/jira/browse/SPARK-13372 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark > > ML LogisticRegression behaves incorrectly when standardization = false && > regParam = 0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13370) Lexer not handling whitespaces properly
[ https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151822#comment-15151822 ] Herman van Hovell edited comment on SPARK-13370 at 2/18/16 6:58 AM: Whitespace is optional. This may sound funny, but the following expression is perfectly legal: {{select 1+1}}. In 99% of the time cases without proper whitespace get caught by either eager lexer rules or parse rules. I'll leave it open for discussion, but I think this is a won't fix. was (Author: hvanhovell): Whitespace is optional. This may sound funny, but the following expression is perfectly legal: {{select 1+1}}. In 99% of the time cases without whitespace get caught by either eager lexer rules or parse rules. I'll leave it open for discussion, but I think this is a won't fix. > Lexer not handling whitespaces properly > --- > > Key: SPARK-13370 > URL: https://issues.apache.org/jira/browse/SPARK-13370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > I was experimenting numeric suffixes and came to the following wrong query > string, which should result in a parsing error: > {code} > // 1.0L is illegal here > sqlContext.sql("SELECT 1.0D + 1.0L").show() > {code} > However, it gives the following result: > {noformat} > +---+ > | L| > +---+ > |2.0| > +---+ > {noformat} > {{explain}} suggests that the {{L}} is recognized as an alias: > {noformat} > sqlContext.sql("SELECT 1.0D + 1.0L").explain(true) > == Parsed Logical Plan == > 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)] > +- OneRowRelation$ > {noformat} > Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's > an whitespace there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13370) Lexer not handling whitespaces properly
[ https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151822#comment-15151822 ] Herman van Hovell commented on SPARK-13370: --- Whitespace is optional. This may sound funny, but the following expression is perfectly legal: {{select 1+1}}. In 99% of the time cases without whitespace get caught by either eager lexer rules or parse rules. I'll leave it open for discussion, but I think this is a won't fix. > Lexer not handling whitespaces properly > --- > > Key: SPARK-13370 > URL: https://issues.apache.org/jira/browse/SPARK-13370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > I was experimenting numeric suffixes and came to the following wrong query > string, which should result in a parsing error: > {code} > // 1.0L is illegal here > sqlContext.sql("SELECT 1.0D + 1.0L").show() > {code} > However, it gives the following result: > {noformat} > +---+ > | L| > +---+ > |2.0| > +---+ > {noformat} > {{explain}} suggests that the {{L}} is recognized as an alias: > {noformat} > sqlContext.sql("SELECT 1.0D + 1.0L").explain(true) > == Parsed Logical Plan == > 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)] > +- OneRowRelation$ > {noformat} > Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's > an whitespace there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2090) spark-shell input text entry not showing on REPL
[ https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151802#comment-15151802 ] Lantao Jin edited comment on SPARK-2090 at 2/18/16 6:36 AM: Richard is right, this is the permissions problem on home directory. In many cases, the user don't have full permissions with user home. For example, we use LDAP system to login, but it is forbidden to create any user directory. It will lead the Spark REPL show nothing about input text. The source code root cause is below: Scala load "use.home" from system properties to create a default history file, code is: {code:title=org.apache.spark.repl.SparkJLineReader.scala|borderStyle=solid} /** Changes the default history file to not collide with the scala repl's. */ private[repl] class SparkJLineHistory extends JLineFileHistory { import Properties.userHome def defaultFileName = ".spark_history" override protected lazy val historyFile = File(Path(userHome) / defaultFileName) } {code} And "userHome" is defined in scala.util.PropertiesTrait {code:title=scala.util.PropertiesTrait.scala|borderStyle=solid} def userHome = propOrEmpty("user.home") {code} It will load from library.properties build-in scala-library.jar. In most cases, it will use the system default value. If the default value directory has no write permission for the login user. The file ".spark_history" can not be create and spark-shell cat not show any input text on REPL. So, I do two step to resolve this problem(Our company use LDAP account to login a machine and forbid to create any LDAP user directory) {panel} 1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER" to spark-env.sh 2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit or spark-shell {panel} then it will works. Next I will review the spark trunk code to find how to resolve it gracefully. was (Author: cltlfcjin): Richard is right, this is the permissions problem on home directory. In many cases, the user don't have full permissions with user home. For example, we use LDAP system to login, but it is forbidden to create any user directory. It will lead the Spark REPL show nothing about input text. The source code root cause is below: Scala load "use.home" from system properties to create a default history file, code is: {code:title=org.apache.spark.repl.SparkJLineReader.scala|borderStyle=solid} /** Changes the default history file to not collide with the scala repl's. */ private[repl] class SparkJLineHistory extends JLineFileHistory { import Properties.userHome def defaultFileName = ".spark_history" override protected lazy val historyFile = File(Path(userHome) / defaultFileName) } {code} And "userHome" is defined in scala.util.PropertiesTrait {code:title=scala.util.PropertiesTrait.scala|borderStyle=solid} def userHome = propOrEmpty("user.home") {code} It will load from library.properties build-in scala-library.jar. In most cases, it will use the system default value. If the default value directory has no write permission for the login user. The file ".spark_history" can not be create and spark-shell cat not show any input text on REPL. So, I do two step to resolve this problem(Our company use LDAP account to login a machine and forbid to create any LDAP user directory) 1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER" to spark-env.sh 2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit or spark-shell then it will works. Next I will review the spark trunk code to find how to resolve it gracefully. > spark-shell input text entry not showing on REPL > > > Key: SPARK-2090 > URL: https://issues.apache.org/jira/browse/SPARK-2090 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 1.0.0 > Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java > HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) >Reporter: Richard Conway >Priority: Critical > Labels: easyfix, patch > Fix For: 1.0.0 > > Original Estimate: 4h > Remaining Estimate: 4h > > spark-shell doesn't allow text to be displayed on input > Failed to created SparkJLineReader: java.io.IOException: Permission denied > Falling back to SimpleReader. > The driver has 2 workers on 2 virtual machines and error free apart from the > above line so I think it may have something to do with the introduction of > the new SecurityManager. > The upshot is that when you type nothing is displayed on the screen. For > example, type "test" at the scala prompt and you won't see the input but the > output will show. > scala> :11: error: package test is not a value > test >
[jira] [Comment Edited] (SPARK-2090) spark-shell input text entry not showing on REPL
[ https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151802#comment-15151802 ] Lantao Jin edited comment on SPARK-2090 at 2/18/16 6:35 AM: Richard is right, this is the permissions problem on home directory. In many cases, the user don't have full permissions with user home. For example, we use LDAP system to login, but it is forbidden to create any user directory. It will lead the Spark REPL show nothing about input text. The source code root cause is below: Scala load "use.home" from system properties to create a default history file, code is: {code:title=org.apache.spark.repl.SparkJLineReader.scala|borderStyle=solid} /** Changes the default history file to not collide with the scala repl's. */ private[repl] class SparkJLineHistory extends JLineFileHistory { import Properties.userHome def defaultFileName = ".spark_history" override protected lazy val historyFile = File(Path(userHome) / defaultFileName) } {code} And "userHome" is defined in scala.util.PropertiesTrait {code:title=scala.util.PropertiesTrait.scala|borderStyle=solid} def userHome = propOrEmpty("user.home") {code} It will load from library.properties build-in scala-library.jar. In most cases, it will use the system default value. If the default value directory has no write permission for the login user. The file ".spark_history" can not be create and spark-shell cat not show any input text on REPL. So, I do two step to resolve this problem(Our company use LDAP account to login a machine and forbid to create any LDAP user directory) 1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER" to spark-env.sh 2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit or spark-shell then it will works. Next I will review the spark trunk code to find how to resolve it gracefully. was (Author: cltlfcjin): Richard is right, this is the permissions problem on home directory. In many cases, the user don't have full permissions with user home. For example, we use LDAP system to login, but it is forbidden to create any user directory. It will lead the Spark REPL show nothing about input text. The source code root cause is below: Scala load "use.home" from system properties to create a default history file, code is: org.apache.spark.repl.SparkJLineReader--- /** Changes the default history file to not collide with the scala repl's. */ private[repl] class SparkJLineHistory extends JLineFileHistory { import Properties.userHome def defaultFileName = ".spark_history" override protected lazy val historyFile = File(Path(userHome) / defaultFileName) } --- And "userHome" is defined in scala.util.PropertiesTrait scala.util.PropertiesTrait- def userHome = propOrEmpty("user.home") -- It will load from library.properties build-in scala-library.jar. In most cases, it will use the system default value. If the default value directory has no write permission for the login user. The file ".spark_history" can not be create and spark-shell cat not show any input text on REPL. So, I do two step to resolve this problem(Our company use LDAP account to login a machine and forbid to create any LDAP user directory) 1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER" to spark-env.sh 2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit or spark-shell then it will works. Next I will review the spark trunk code to find how to resolve it gracefully. > spark-shell input text entry not showing on REPL > > > Key: SPARK-2090 > URL: https://issues.apache.org/jira/browse/SPARK-2090 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 1.0.0 > Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java > HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) >Reporter: Richard Conway >Priority: Critical > Labels: easyfix, patch > Fix For: 1.0.0 > > Original Estimate: 4h > Remaining Estimate: 4h > > spark-shell doesn't allow text to be displayed on input > Failed to created SparkJLineReader: java.io.IOException: Permission denied > Falling back to SimpleReader. > The driver has 2 workers on 2 virtual machines and error free apart from the > above line so I think it may have something to do with the introduction of > the new SecurityManager. > The upshot is that when you type nothing is displayed on the screen. For > example, type "test" at the scala prompt and you won't see the input but the > output will show. >
[jira] [Created] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0
Yanbo Liang created SPARK-13372: --- Summary: ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0 Key: SPARK-13372 URL: https://issues.apache.org/jira/browse/SPARK-13372 Project: Spark Issue Type: Bug Components: ML Reporter: Yanbo Liang ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2090) spark-shell input text entry not showing on REPL
[ https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151802#comment-15151802 ] Lantao Jin commented on SPARK-2090: --- Richard is right, this is the permissions problem on home directory. In many cases, the user don't have full permissions with user home. For example, we use LDAP system to login, but it is forbidden to create any user directory. It will lead the Spark REPL show nothing about input text. The source code root cause is below: Scala load "use.home" from system properties to create a default history file, code is: org.apache.spark.repl.SparkJLineReader--- /** Changes the default history file to not collide with the scala repl's. */ private[repl] class SparkJLineHistory extends JLineFileHistory { import Properties.userHome def defaultFileName = ".spark_history" override protected lazy val historyFile = File(Path(userHome) / defaultFileName) } --- And "userHome" is defined in scala.util.PropertiesTrait scala.util.PropertiesTrait- def userHome = propOrEmpty("user.home") -- It will load from library.properties build-in scala-library.jar. In most cases, it will use the system default value. If the default value directory has no write permission for the login user. The file ".spark_history" can not be create and spark-shell cat not show any input text on REPL. So, I do two step to resolve this problem(Our company use LDAP account to login a machine and forbid to create any LDAP user directory) 1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER" to spark-env.sh 2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit or spark-shell then it will works. Next I will review the spark trunk code to find how to resolve it gracefully. > spark-shell input text entry not showing on REPL > > > Key: SPARK-2090 > URL: https://issues.apache.org/jira/browse/SPARK-2090 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 1.0.0 > Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java > HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) >Reporter: Richard Conway >Priority: Critical > Labels: easyfix, patch > Fix For: 1.0.0 > > Original Estimate: 4h > Remaining Estimate: 4h > > spark-shell doesn't allow text to be displayed on input > Failed to created SparkJLineReader: java.io.IOException: Permission denied > Falling back to SimpleReader. > The driver has 2 workers on 2 virtual machines and error free apart from the > above line so I think it may have something to do with the introduction of > the new SecurityManager. > The upshot is that when you type nothing is displayed on the screen. For > example, type "test" at the scala prompt and you won't see the input but the > output will show. > scala> :11: error: package test is not a value > test > ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13371) Compare Option[String] and String directly
Guoqiang Li created SPARK-13371: --- Summary: Compare Option[String] and String directly Key: SPARK-13371 URL: https://issues.apache.org/jira/browse/SPARK-13371 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.6.0, 1.5.2 Reporter: Guoqiang Li {noformat} TaskSetManager.dequeueSpeculativeTask compares Option[String] and String directly. {noformat} Ths code: https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344 {code} if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { for (rack <- sched.getRackForHost(host)) { for (index <- speculatableTasks if canRunOnHost(index)) { val racks = tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost) // racks: Seq[Option[String]] and rack: String if (racks.contains(rack)) { speculatableTasks -= index return Some((index, TaskLocality.RACK_LOCAL)) } } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151746#comment-15151746 ] Xiao Li edited comment on SPARK-1 at 2/18/16 5:15 AM: -- Yeah, you are right. This part is an issue. That is why I did not remove the existing rand/randn function with a seed argument. I am unable to find a proper solution for controlling data partitioning and task scheduling when users call rand/randn functions with a specific seed number. At least, we have to let users realize these results are not deterministic. Actually, I think this is also an issue in many of our test cases. Many test cases are using seeds with an assumption that rand/rand(seed) can provide a deterministic result. was (Author: smilegator): Yeah, you are right. This part is an issue. That is why I did not remove the existing rand/randn function with a seed argument. I am unable to find a solution for controlling data partitioning and task scheduling when users call rand/randn functions with a specific seed number. At least, we have to let users realize these results are not deterministic. Actually, I think this is also an issue in many of our test cases. Many test cases are using seeds with an assumption that rand/rand(seed) can provide a deterministic result. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151746#comment-15151746 ] Xiao Li commented on SPARK-1: - Yeah, you are right. This part is an issue. That is why I did not remove the existing rand/randn function with a seed argument. I am unable to find a solution for controlling data partitioning and task scheduling when users call rand/randn functions with a specific seed number. At least, we have to let users realize these results are not deterministic. Actually, I think this is also an issue in many of our test cases. Many test cases are using seeds with an assumption that rand/rand(seed) can provide a deterministic result. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13370) Lexer not handling whitespaces properly
Cheng Lian created SPARK-13370: -- Summary: Lexer not handling whitespaces properly Key: SPARK-13370 URL: https://issues.apache.org/jira/browse/SPARK-13370 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian I was experimenting numeric suffixes and came to the following wrong query string, which should result in a parsing error: {code} // 1.0L is illegal here sqlContext.sql("SELECT 1.0D + 1.0L").show() {code} However, it gives the following result: {noformat} +---+ | L| +---+ |2.0| +---+ {noformat} {{explain}} suggests that the {{L}} is recognized as an alias: {noformat} sqlContext.sql("SELECT 1.0D + 1.0L").explain(true) == Parsed Logical Plan == 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)] +- OneRowRelation$ {noformat} Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's an whitespace there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13369) Number of consecutive fetch failures for a stage before the job is aborted should be configurable
Sital Kedia created SPARK-13369: --- Summary: Number of consecutive fetch failures for a stage before the job is aborted should be configurable Key: SPARK-13369 URL: https://issues.apache.org/jira/browse/SPARK-13369 Project: Spark Issue Type: Improvement Affects Versions: 1.6.0 Reporter: Sital Kedia Currently it is hardcode inside code. We need to make it configurable because for long running jobs, the chances of fetch failures due to machine reboot is high and we need a configuration parameter to bump up that number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151720#comment-15151720 ] Liang-Chi Hsieh edited comment on SPARK-1 at 2/18/16 4:13 AM: -- The problem is when you have many data partitions, you will have same random number sequences in each partition. E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you do something required with random number involved for each row, your 201 ~ 400 rows will be processed with same random numbers as the first 200 rows. Then your computation will not be random as it requires at the beginning. Your each partition will share a same pattern. was (Author: viirya): Yes. I agree that when user provides a specific seed number, the result should be deterministic. The problem is when you have many data partitions, you will have same random number sequences in each partition. E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you do something required with random number involved for each row, your 201 ~ 400 rows will be processed with same random numbers as the first 200 rows. Then your computation will not be random as it requires at the beginning. Your each partition will share a same pattern. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151720#comment-15151720 ] Liang-Chi Hsieh commented on SPARK-1: - Yes. I agree that when user provides a specific seed number, the result should be deterministic. The problem is when you have many data partitions, you will have same random number sequences in each partition. E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you do something required with random number involved for each row, your 201 ~ 400 rows will be processed with same random numbers as the first 200 rows. Then your computation will not be random as it requires at the beginning. Your each partition will share a same pattern. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13364) history server application column not sorting properly
[ https://issues.apache.org/jira/browse/SPARK-13364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151721#comment-15151721 ] Zhuo Liu commented on SPARK-13364: -- It is not sorting by , but a lexicographical sorting according to application Id, so application__4000 might be before application__900 for ascending. We can add special sorting class for application Id if needed. > history server application column not sorting properly > -- > > Key: SPARK-13364 > URL: https://issues.apache.org/jira/browse/SPARK-13364 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves > > The new history server is using datatables, the application column isn't > sorting them properly. Its not sorting the last _X part right. below is > an example where the 30174 should be before 30149 > application_1453493359692_30149 > application_1453493359692_30174 > I'm guessing its sorting used the string rather then just the > application id. > href="/history/application_1453493359692_30029/1/jobs/">application_1453493359692_30029 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically
[ https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151710#comment-15151710 ] Xusen Yin commented on SPARK-13368: --- FYI [~mengxr] [~josephkb] > PySpark JavaModel fails to extract params from Spark side automatically > --- > > Key: SPARK-13368 > URL: https://issues.apache.org/jira/browse/SPARK-13368 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > > JavaModel fails to extract params from Spark side automatically that causes > model.extractParamMap() is always empty. As shown in the example code below > copied from Spark Guide > https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param > {code} > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique > # IDs for this LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > {code} > The result of model1.extractParamMap() is {}. > Question is, should we provide the feature or not? If yes, we need either let > Model share same params with Estimator or adds a parent in Model and points > to its Estimator; if not, we should remove those lines from example code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically
[ https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-13368: -- Description: JavaModel fails to extract params from Spark side automatically that causes model.extractParamMap() is always empty. As shown in the example code below copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html {code} # Prepare training data from a list of (label, features) tuples. training = sqlContext.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique # IDs for this LogisticRegression instance. print "Model 1 was fit using parameters: " print model1.extractParamMap() {code} The result of model1.extractParamMap() is {}. Question is, should we provide the feature or not? If yes, we need either let Model share same params with Estimator or adds a parent in Model and points to its Estimator; if not, we should remove those lines from example code. was: JavaModel fails to extract params from Spark side automatically that causes model.extractParamMap() is always empty. As shown in the example code below copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html: {code} # Prepare training data from a list of (label, features) tuples. training = sqlContext.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique # IDs for this LogisticRegression instance. print "Model 1 was fit using parameters: " print model1.extractParamMap() {code} The result of model1.extractParamMap() is {}. Question is, should we provide the feature or not? If yes, we need either let Model share same params with Estimator or adds a parent in Model and points to its Estimator; if not, we should remove those lines from example code. > PySpark JavaModel fails to extract params from Spark side automatically > --- > > Key: SPARK-13368 > URL: https://issues.apache.org/jira/browse/SPARK-13368 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > > JavaModel fails to extract params from Spark side automatically that causes > model.extractParamMap() is always empty. As shown in the example code below > copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html > {code} > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique > # IDs for this
[jira] [Updated] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically
[ https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-13368: -- Description: JavaModel fails to extract params from Spark side automatically that causes model.extractParamMap() is always empty. As shown in the example code below copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param {code} # Prepare training data from a list of (label, features) tuples. training = sqlContext.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique # IDs for this LogisticRegression instance. print "Model 1 was fit using parameters: " print model1.extractParamMap() {code} The result of model1.extractParamMap() is {}. Question is, should we provide the feature or not? If yes, we need either let Model share same params with Estimator or adds a parent in Model and points to its Estimator; if not, we should remove those lines from example code. was: JavaModel fails to extract params from Spark side automatically that causes model.extractParamMap() is always empty. As shown in the example code below copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html {code} # Prepare training data from a list of (label, features) tuples. training = sqlContext.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique # IDs for this LogisticRegression instance. print "Model 1 was fit using parameters: " print model1.extractParamMap() {code} The result of model1.extractParamMap() is {}. Question is, should we provide the feature or not? If yes, we need either let Model share same params with Estimator or adds a parent in Model and points to its Estimator; if not, we should remove those lines from example code. > PySpark JavaModel fails to extract params from Spark side automatically > --- > > Key: SPARK-13368 > URL: https://issues.apache.org/jira/browse/SPARK-13368 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > > JavaModel fails to extract params from Spark side automatically that causes > model.extractParamMap() is always empty. As shown in the example code below > copied from Spark Guide > https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param > {code} > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the
[jira] [Created] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically
Xusen Yin created SPARK-13368: - Summary: PySpark JavaModel fails to extract params from Spark side automatically Key: SPARK-13368 URL: https://issues.apache.org/jira/browse/SPARK-13368 Project: Spark Issue Type: Bug Components: PySpark Reporter: Xusen Yin JavaModel fails to extract params from Spark side automatically that causes model.extractParamMap() is always empty. As shown in the example code below copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html: {code} # Prepare training data from a list of (label, features) tuples. training = sqlContext.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique # IDs for this LogisticRegression instance. print "Model 1 was fit using parameters: " print model1.extractParamMap() {code} The result of model1.extractParamMap() is {}. Question is, should we provide the feature or not? If yes, we need either let Model share same params with Estimator or adds a parent in Model and points to its Estimator; if not, we should remove those lines from example code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically
[ https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-13368: -- Priority: Minor (was: Major) > PySpark JavaModel fails to extract params from Spark side automatically > --- > > Key: SPARK-13368 > URL: https://issues.apache.org/jira/browse/SPARK-13368 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > > JavaModel fails to extract params from Spark side automatically that causes > model.extractParamMap() is always empty. As shown in the example code below > copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html: > {code} > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique > # IDs for this LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > {code} > The result of model1.extractParamMap() is {}. > Question is, should we provide the feature or not? If yes, we need either let > Model share same params with Estimator or adds a parent in Model and points > to its Estimator; if not, we should remove those lines from example code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13324) Update plugin, test, example dependencies for 2.x
[ https://issues.apache.org/jira/browse/SPARK-13324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13324. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11206 [https://github.com/apache/spark/pull/11206] > Update plugin, test, example dependencies for 2.x > - > > Key: SPARK-13324 > URL: https://issues.apache.org/jira/browse/SPARK-13324 > Project: Spark > Issue Type: Improvement > Components: Build, Examples, Spark Core >Affects Versions: 2.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.0.0 > > > I'd like to update dependencies for 2.x as much as we can. > To start, we can update the low-risk dependencies: build plugins, test > dependencies, third-party examples / integrations. > Later I'll try and propose more significant updates to core dependencies, > excepting those that break support or compatibility significantly. > Then we'll look at the tough updates separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13360) pyspark related enviroment variable is not propagated to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-13360: --- Description: Such as PYSPARK_DRIVER_PYTHON, PYSPARK_PYTHON, PYTHONHASHSEED. > pyspark related enviroment variable is not propagated to driver in > yarn-cluster mode > > > Key: SPARK-13360 > URL: https://issues.apache.org/jira/browse/SPARK-13360 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.6.0 >Reporter: Jeff Zhang > > Such as PYSPARK_DRIVER_PYTHON, PYSPARK_PYTHON, PYTHONHASHSEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13360) pyspark related enviroment variable is not propagated to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-13360: --- Summary: pyspark related enviroment variable is not propagated to driver in yarn-cluster mode (was: PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON is not picked up in yarn-cluster mode) > pyspark related enviroment variable is not propagated to driver in > yarn-cluster mode > > > Key: SPARK-13360 > URL: https://issues.apache.org/jira/browse/SPARK-13360 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.6.0 >Reporter: Jeff Zhang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13363) Aggregator not working with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13363: - Priority: Blocker (was: Minor) > Aggregator not working with DataFrame > - > > Key: SPARK-13363 > URL: https://issues.apache.org/jira/browse/SPARK-13363 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: koert kuipers >Priority: Blocker > > org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class > for user-defined aggregations, which can be used in [[DataFrame]] and > [[Dataset]] > it works well with Dataset/GroupedDataset, but i am having no luck using it > with DataFrame/GroupedData. does anyone have an example how to use it with a > DataFrame? > in particular i would like to use it with this method in GroupedData: > {noformat} > def agg(expr: Column, exprs: Column*): DataFrame > {noformat} > clearly it should be possible, since GroupedDataset uses that very same > method to do the work: > {noformat} > private def agg(exprs: Column*): DataFrame = > groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*) > {noformat} > the trick seems to be the wrapping in withEncoder, which is private. i tried > to do something like it myself, but i had no luck since it uses more private > stuff in TypedColumn. > anyhow, my attempt at using it in DataFrame: > {noformat} > val simpleSum = new SqlAggregator[Int, Int, Int] { > def zero: Int = 0 // The initial value. > def reduce(b: Int, a: Int) = b + a// Add an element to the running total > def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values. > def finish(b: Int) = b// Return the final result. > }.toColumn > val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v") > df.groupBy("k").agg(simpleSum).show > {noformat} > and the resulting error: > {noformat} > org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate > [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13363) Aggregator not working with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13363: - Affects Version/s: (was: 2.0.0) 1.6.0 Target Version/s: 2.0.0 > Aggregator not working with DataFrame > - > > Key: SPARK-13363 > URL: https://issues.apache.org/jira/browse/SPARK-13363 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: koert kuipers >Priority: Minor > > org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class > for user-defined aggregations, which can be used in [[DataFrame]] and > [[Dataset]] > it works well with Dataset/GroupedDataset, but i am having no luck using it > with DataFrame/GroupedData. does anyone have an example how to use it with a > DataFrame? > in particular i would like to use it with this method in GroupedData: > {noformat} > def agg(expr: Column, exprs: Column*): DataFrame > {noformat} > clearly it should be possible, since GroupedDataset uses that very same > method to do the work: > {noformat} > private def agg(exprs: Column*): DataFrame = > groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*) > {noformat} > the trick seems to be the wrapping in withEncoder, which is private. i tried > to do something like it myself, but i had no luck since it uses more private > stuff in TypedColumn. > anyhow, my attempt at using it in DataFrame: > {noformat} > val simpleSum = new SqlAggregator[Int, Int, Int] { > def zero: Int = 0 // The initial value. > def reduce(b: Int, a: Int) = b + a// Add an element to the running total > def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values. > def finish(b: Int) = b// Return the final result. > }.toColumn > val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v") > df.groupBy("k").agg(simpleSum).show > {noformat} > and the resulting error: > {noformat} > org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate > [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory
[ https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151604#comment-15151604 ] dylanzhou edited comment on SPARK-13183 at 2/18/16 2:33 AM: [~srowen] i donot know this is a memory leak problem, i get heap memory error java.lang.OutOfMemoryError:Java for heap space . When I try to increase driver memory, just streaming programs work a little longer, in my opinion byte[] objects cannot be reclaimed by the GC,These object cache is spark SQL table rows .When I increase the amount of data that flows into Kafka, memory consumption and faster。 Can you give me some advice? Here is my question, thank you! http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html was (Author: dylanzhou): @Sean Owen maybe is a memory leak problem, and finally will run out of heap memory error java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver memory, just streaming programs work a little longer, in my opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount of data that flows into Kafka, memory consumption and faster。 Can you give me some advice? Here is my question, thank you! http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html > Bytebuffers occupy a large amount of heap memory > > > Key: SPARK-13183 > URL: https://issues.apache.org/jira/browse/SPARK-13183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: dylanzhou > > When I used sparkstreamimg and sparksql, i cache the table,found that old gen > increases very fast and full GC is very frequent, running for a period of > time will be out of memory, after analysis of heap memory, found that there > are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ > 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer > occupy, don't know why these objects are not released, had been waiting for > GC to recycle;if i donot use cache table, there will be no this problem, but > I need to repeatedly query this table do -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13183) Bytebuffers occupy a large amount of heap memory
[ https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dylanzhou updated SPARK-13183: -- Comment: was deleted (was: @Sean Owen maybe is a memory leak problem, and finally will run out of heap memory error java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver memory, just streaming programs work a little longer, in my opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount of data that flows into Kafka, memory consumption and faster。 Can you give me some advice? Here is my question, thank you! http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html ) > Bytebuffers occupy a large amount of heap memory > > > Key: SPARK-13183 > URL: https://issues.apache.org/jira/browse/SPARK-13183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: dylanzhou > > When I used sparkstreamimg and sparksql, i cache the table,found that old gen > increases very fast and full GC is very frequent, running for a period of > time will be out of memory, after analysis of heap memory, found that there > are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ > 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer > occupy, don't know why these objects are not released, had been waiting for > GC to recycle;if i donot use cache table, there will be no this problem, but > I need to repeatedly query this table do -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13183) Bytebuffers occupy a large amount of heap memory
[ https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151604#comment-15151604 ] dylanzhou commented on SPARK-13183: --- @Sean Owen maybe is a memory leak problem, and finally will run out of heap memory error java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver memory, just streaming programs work a little longer, in my opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount of data that flows into Kafka, memory consumption and faster。 Can you give me some advice? Here is my question, thank you! http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html > Bytebuffers occupy a large amount of heap memory > > > Key: SPARK-13183 > URL: https://issues.apache.org/jira/browse/SPARK-13183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: dylanzhou > > When I used sparkstreamimg and sparksql, i cache the table,found that old gen > increases very fast and full GC is very frequent, running for a period of > time will be out of memory, after analysis of heap memory, found that there > are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ > 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer > occupy, don't know why these objects are not released, had been waiting for > GC to recycle;if i donot use cache table, there will be no this problem, but > I need to repeatedly query this table do -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job
[ https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151591#comment-15151591 ] Jon Maurer commented on SPARK-10001: I have a number of users who would find this feature to be useful. The related scenario is hitting ctrl-c by mistake as an attempt to copy output and accidentally killing the shell, even if a job has not yet been submitted. Is there other interest in this feature? > Allow Ctrl-C in spark-shell to kill running job > --- > > Key: SPARK-10001 > URL: https://issues.apache.org/jira/browse/SPARK-10001 > Project: Spark > Issue Type: Sub-task > Components: Spark Shell >Affects Versions: 1.4.1 >Reporter: Cheolsoo Park >Priority: Minor > > Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running > job and starts a new input line on the prompt. It would be nice if > spark-shell also can do that. Otherwise, in case a user submits a job, say he > made a mistake, and wants to cancel it, he needs to exit the shell and > re-login to continue his work. Re-login can be a pain especially in Spark on > yarn, since it takes a while to allocate AM container and initial executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6263) Python MLlib API missing items: Utils
[ https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151508#comment-15151508 ] Bruno Wu commented on SPARK-6263: - kFold function is still not available in util.py (as far as I can see). Can I work on this JIRA to add in kFold? > Python MLlib API missing items: Utils > - > > Key: SPARK-6263 > URL: https://issues.apache.org/jira/browse/SPARK-6263 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Kai Sasaki > Fix For: 1.5.0 > > > This JIRA lists items missing in the Python API for this sub-package of MLlib. > This list may be incomplete, so please check again when sending a PR to add > these features to the Python API. > Also, please check for major disparities between documentation; some parts of > the Python API are less well-documented than their Scala counterparts. Some > items may be listed in the umbrella JIRA linked to this task. > MLUtils > * appendBias > * kFold > * loadVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13367) Refactor KinesisUtils to specify more KCL options
[ https://issues.apache.org/jira/browse/SPARK-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151485#comment-15151485 ] Apache Spark commented on SPARK-13367: -- User 'addisonj' has created a pull request for this issue: https://github.com/apache/spark/pull/11245 > Refactor KinesisUtils to specify more KCL options > - > > Key: SPARK-13367 > URL: https://issues.apache.org/jira/browse/SPARK-13367 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Addison Higham > > Currently, the KinesisUtils doesn't allow for configuring certain options, > such as the dynamoDB endpoint or cloudwatch options. > This is a useful feature for being able to do local integration testing with > a tool like https://github.com/mhart/kinesalite and DynamoDB-Local. > The code is also somewhat complicated as related configuration options are > passed independently and could be improved by a configuration object that > owns all the configuration concerns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13367) Refactor KinesisUtils to specify more KCL options
[ https://issues.apache.org/jira/browse/SPARK-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13367: Assignee: Apache Spark > Refactor KinesisUtils to specify more KCL options > - > > Key: SPARK-13367 > URL: https://issues.apache.org/jira/browse/SPARK-13367 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Addison Higham >Assignee: Apache Spark > > Currently, the KinesisUtils doesn't allow for configuring certain options, > such as the dynamoDB endpoint or cloudwatch options. > This is a useful feature for being able to do local integration testing with > a tool like https://github.com/mhart/kinesalite and DynamoDB-Local. > The code is also somewhat complicated as related configuration options are > passed independently and could be improved by a configuration object that > owns all the configuration concerns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13367) Refactor KinesisUtils to specify more KCL options
[ https://issues.apache.org/jira/browse/SPARK-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13367: Assignee: (was: Apache Spark) > Refactor KinesisUtils to specify more KCL options > - > > Key: SPARK-13367 > URL: https://issues.apache.org/jira/browse/SPARK-13367 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Addison Higham > > Currently, the KinesisUtils doesn't allow for configuring certain options, > such as the dynamoDB endpoint or cloudwatch options. > This is a useful feature for being able to do local integration testing with > a tool like https://github.com/mhart/kinesalite and DynamoDB-Local. > The code is also somewhat complicated as related configuration options are > passed independently and could be improved by a configuration object that > owns all the configuration concerns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13344) Tests have many "accumulator not found" exceptions
[ https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13344. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11222 [https://github.com/apache/spark/pull/11222] > Tests have many "accumulator not found" exceptions > -- > > Key: SPARK-13344 > URL: https://issues.apache.org/jira/browse/SPARK-13344 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > This is because SparkFunSuite clears all accumulators after every single > test. This suite reuses a DF and all of its associated internal accumulators > across many tests. E.g. SaveLoadSuite, InnerJoinSuite, many others. > This is likely caused by SPARK-10620. > {code} > 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered > unregistered accumulator 253 when reconstructing task metrics. > 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update > accumulators for task 0 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 253 > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13367) Refactor KinesisUtils to specify more KCL options
Addison Higham created SPARK-13367: -- Summary: Refactor KinesisUtils to specify more KCL options Key: SPARK-13367 URL: https://issues.apache.org/jira/browse/SPARK-13367 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.6.0 Reporter: Addison Higham Currently, the KinesisUtils doesn't allow for configuring certain options, such as the dynamoDB endpoint or cloudwatch options. This is a useful feature for being able to do local integration testing with a tool like https://github.com/mhart/kinesalite and DynamoDB-Local. The code is also somewhat complicated as related configuration options are passed independently and could be improved by a configuration object that owns all the configuration concerns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore
[ https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151406#comment-15151406 ] Henry Saputra commented on SPARK-2541: -- Based on discussion on https://github.com/apache/spark/pull/2320 Seemed like we should not close this as dup of https://issues.apache.org/jira/browse/SPARK-3438 This should cover case where a standalone cluster is used to access secure HDFS for single user scenario. > Standalone mode can't access secure HDFS anymore > > > Key: SPARK-2541 > URL: https://issues.apache.org/jira/browse/SPARK-2541 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0, 1.0.1 >Reporter: Thomas Graves > Attachments: SPARK-2541-partial.patch > > > In spark 0.9.x you could access secure HDFS from Standalone deploy, that > doesn't work in 1.X anymore. > It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it > wouldn't do the doAs if the currentUser == user. Not sure how it affects > when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12953. Resolution: Fixed Fix Version/s: 2.0.0 Fixed by PR for 2.0.0. > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: shijinkui >Assignee: shijinkui >Priority: Minor > Fix For: 2.0.0 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12953: --- Assignee: shijinkui > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: shijinkui >Assignee: shijinkui >Priority: Minor > Fix For: 2.0.0 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reopened SPARK-12953: > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: shijinkui >Assignee: shijinkui >Priority: Minor > Fix For: 2.0.0 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13109) SBT publishLocal failed to publish to local ivy repo
[ https://issues.apache.org/jira/browse/SPARK-13109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13109. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11001 [https://github.com/apache/spark/pull/11001] > SBT publishLocal failed to publish to local ivy repo > > > Key: SPARK-13109 > URL: https://issues.apache.org/jira/browse/SPARK-13109 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0 >Reporter: Saisai Shao > Fix For: 2.0.0 > > > Because of overriding the {{external-resolvers}} to Maven central and local > only in sbt, now {{sbt publishLocal}} is failed to publish to local ivy > repo, the detailed exception is showing in the dev mail list > (http://apache-spark-developers-list.1001551.n3.nabble.com/sbt-publish-local-fails-with-2-0-0-SNAPSHOT-td16168.html). > Possibly two solutions: > 1. Add ivy local repo to the {{external-resolvers}}. > 2. Do not publish to local ivy repo. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13109) SBT publishLocal failed to publish to local ivy repo
[ https://issues.apache.org/jira/browse/SPARK-13109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13109: --- Assignee: Saisai Shao > SBT publishLocal failed to publish to local ivy repo > > > Key: SPARK-13109 > URL: https://issues.apache.org/jira/browse/SPARK-13109 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.0.0 > > > Because of overriding the {{external-resolvers}} to Maven central and local > only in sbt, now {{sbt publishLocal}} is failed to publish to local ivy > repo, the detailed exception is showing in the dev mail list > (http://apache-spark-developers-list.1001551.n3.nabble.com/sbt-publish-local-fails-with-2-0-0-SNAPSHOT-td16168.html). > Possibly two solutions: > 1. Add ivy local repo to the {{external-resolvers}}. > 2. Do not publish to local ivy repo. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151368#comment-15151368 ] Max Seiden commented on SPARK-12449: Very interested in checking out that PR! It would be prudent to have a holistic high-level design for any work here too, mostly to answer a few major questions. A random sample of such Qs: + Should there be a new trait for each new `sources.*` type, or a single trait that communicates capabilities to the planner (i.e. the CatalystSource design)? a) a new trait for each source could get unwieldy given the potential # of permutations b) a single, generic trait is powerful, but it puts a lot of burden on the implementer to cover more cases than they may want + Depending on the above, should source plans be a tree of operators or a list of operators to be applied in-order? a) the first option is more natural, but is smells a lot like catalyst -- not a bad thing if it's a separate, stable API though + the more that's pushed down via sources.Expressions, the more complex things may get for implementers a) for example, if Aliases are pushed down, there's a lot more opportunity for resolution bugs in the source impl b) a definitive stance would be needed for exprs like UDFs or those dealing with complex types c) without a way to signal capabilities (implicitly or explicitly) to the planner, there'd likely need to be a way to "bail out" > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151345#comment-15151345 ] Evan Chan commented on SPARK-12449: --- [~stephank85] would you have any code to share? :D > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151339#comment-15151339 ] Stephan Kessler commented on SPARK-12449: - [~maxseiden] good idea! In order to simplify things even more - we could get rid (at least in the first shot) of the partitioned and holistic approach, since we aim for databases as datasources. What do you think on keeping the ability to kind of ask the datasource if it supports the pushdown of a well-defined operation? This would simplify the implementation of the datasource as well as the Strategy for the planner. [~velvia] i am currently working heavily on the pushdown of partial aggregates in combination with Tungsten, so i am happy to contribute in that direction. Should i try to formulate a new/simplified design doc that covers the gradual approach? I am very happy to help with the PR and the definitions of tasks as well. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13366) Support Cartesian join for Datasets
[ https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiu (Joe) Guo updated SPARK-13366: -- Description: Saw a comment from [~marmbrus] regarding Cartesian join for Datasets: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." was: Saw a comment from [~marmbrus] about this: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." > Support Cartesian join for Datasets > --- > > Key: SPARK-13366 > URL: https://issues.apache.org/jira/browse/SPARK-13366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiu (Joe) Guo >Priority: Minor > > Saw a comment from [~marmbrus] regarding Cartesian join for Datasets: > "You will get a cartesian if you do a join/joinWith using lit(true) as the > condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13366) Support Cartesian join for Datasets
[ https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151331#comment-15151331 ] Apache Spark commented on SPARK-13366: -- User 'xguo27' has created a pull request for this issue: https://github.com/apache/spark/pull/11244 > Support Cartesian join for Datasets > --- > > Key: SPARK-13366 > URL: https://issues.apache.org/jira/browse/SPARK-13366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiu (Joe) Guo >Priority: Minor > > Saw a comment from [~marmbrus] about this: > "You will get a cartesian if you do a join/joinWith using lit(true) as the > condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13366) Support Cartesian join for Datasets
[ https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13366: Assignee: Apache Spark > Support Cartesian join for Datasets > --- > > Key: SPARK-13366 > URL: https://issues.apache.org/jira/browse/SPARK-13366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiu (Joe) Guo >Assignee: Apache Spark >Priority: Minor > > Saw a comment from [~marmbrus] about this: > "You will get a cartesian if you do a join/joinWith using lit(true) as the > condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13366) Support Cartesian join for Datasets
[ https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13366: Assignee: (was: Apache Spark) > Support Cartesian join for Datasets > --- > > Key: SPARK-13366 > URL: https://issues.apache.org/jira/browse/SPARK-13366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiu (Joe) Guo >Priority: Minor > > Saw a comment from [~marmbrus] about this: > "You will get a cartesian if you do a join/joinWith using lit(true) as the > condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13366) Support Cartesian join for Datasets
Xiu (Joe) Guo created SPARK-13366: - Summary: Support Cartesian join for Datasets Key: SPARK-13366 URL: https://issues.apache.org/jira/browse/SPARK-13366 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiu (Joe) Guo Priority: Minor Saw a comment from Michael about this: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12224) R support for JDBC source
[ https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151298#comment-15151298 ] Felix Cheung commented on SPARK-12224: -- [~shivaram] could you please review the PR comment https://github.com/apache/spark/pull/10480#discussion_r50348037 > R support for JDBC source > - > > Key: SPARK-12224 > URL: https://issues.apache.org/jira/browse/SPARK-12224 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13242) Moderately complex `when` expression causes code generation failure
[ https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151291#comment-15151291 ] Apache Spark commented on SPARK-13242: -- User 'joehalliwell' has created a pull request for this issue: https://github.com/apache/spark/pull/11243 > Moderately complex `when` expression causes code generation failure > --- > > Key: SPARK-13242 > URL: https://issues.apache.org/jira/browse/SPARK-13242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Joe Halliwell > > Moderately complex `when` expressions produce generated code that busts the > 64KB method limit. This causes code generation to fail. > Here's a test case exhibiting the problem: > https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d > I'm interested in working on a fix. I'm thinking it may be possible to split > the expressions along the lines of SPARK-8443, but any pointers would be > welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13344) Tests have many "accumulator not found" exceptions
[ https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13344: -- Summary: Tests have many "accumulator not found" exceptions (was: SaveLoadSuite has many accumulator exceptions) > Tests have many "accumulator not found" exceptions > -- > > Key: SPARK-13344 > URL: https://issues.apache.org/jira/browse/SPARK-13344 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > This is because SparkFunSuite clears all accumulators after every single > test. This suite reuses a DF and all of its associated internal accumulators > across many tests. > This is likely caused by SPARK-10620. > {code} > 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered > unregistered accumulator 253 when reconstructing task metrics. > 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update > accumulators for task 0 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 253 > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13344) Tests have many "accumulator not found" exceptions
[ https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13344: -- Description: This is because SparkFunSuite clears all accumulators after every single test. This suite reuses a DF and all of its associated internal accumulators across many tests. E.g. SaveLoadSuite, InnerJoinSuite, many others. This is likely caused by SPARK-10620. {code} 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered unregistered accumulator 253 when reconstructing task metrics. 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update accumulators for task 0 org.apache.spark.SparkException: attempted to access non-existent accumulator 253 at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) {code} was: This is because SparkFunSuite clears all accumulators after every single test. This suite reuses a DF and all of its associated internal accumulators across many tests. This is likely caused by SPARK-10620. {code} 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered unregistered accumulator 253 when reconstructing task metrics. 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update accumulators for task 0 org.apache.spark.SparkException: attempted to access non-existent accumulator 253 at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) {code} > Tests have many "accumulator not found" exceptions > -- > > Key: SPARK-13344 > URL: https://issues.apache.org/jira/browse/SPARK-13344 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > This is because SparkFunSuite clears all accumulators after every single > test. This suite reuses a DF and all of its associated internal accumulators > across many tests. E.g. SaveLoadSuite, InnerJoinSuite, many others. > This is likely caused by SPARK-10620. > {code} > 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered > unregistered accumulator 253 when reconstructing task metrics. > 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update > accumulators for task 0 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 253 > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13279) Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)
[ https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13279: -- Fix Version/s: (was: 1.7) 2.0.0 > Scheduler does O(N^2) operation when adding a new task set (making it > prohibitively slow for scheduling 200K tasks) > --- > > Key: SPARK-13279 > URL: https://issues.apache.org/jira/browse/SPARK-13279 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 1.6.0 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 1.6.1, 2.0.0 > > > For each task that the TaskSetManager adds, it iterates through the entire > list of existing tasks to check if it's there. As a result, scheduling a new > task set is O(N^2), which can be slow for large task sets. > This is a bug that was introduced by > https://github.com/apache/spark/commit/3535b91: that commit removed the > "!readding" condition from the if-statement, but since the re-adding > parameter defaulted to false, that commit should have removed the condition > check in the if-statement altogether. > - > We discovered this bug while running a large pipeline with 200k tasks, when > we found that the executors were not able to register with the driver because > the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks > function for a long time (it wasn't deadlocked -- just taking a long time). > jstack of the driver - http://pastebin.com/m8CP6VMv > executor log - http://pastebin.com/2NPS1mXC > From the jstack I see that the thread handing the resource offer from > executors (dispatcher-event-loop-9) is blocked on a lock held by the thread > "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer > when adding a pending tasks. So when we have 200k pending tasks, because of > this o(n2) operations, the driver is just hung for more than 5 minutes. > Solution - In addPendingTask function, we don't really need a duplicate > check. It's okay if we add a task to the same queue twice because > dequeueTaskFromList will skip already-running tasks. > Please note that this is a regression from Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151268#comment-15151268 ] Evan Chan commented on SPARK-12449: --- I agree with [~maxseiden] on a gradual approach to push more down into the data sources API.Since I was going to explore a path like this anyways, I'd be willing to submit a PR to explore a `sources.Expression` kind of pushdown. There is also some stuff in 2.0 that might interact with this, such as vectorization and the whole query code gen, that we need to be aware of. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151265#comment-15151265 ] Xiao Li commented on SPARK-1: - Another example is MS SQL Server Rand() https://msdn.microsoft.com/en-us/library/ms177610.aspx {code} seed Is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same. {code} > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13365) should coalesce do anything if coalescing to same number of partitions without shuffle
Thomas Graves created SPARK-13365: - Summary: should coalesce do anything if coalescing to same number of partitions without shuffle Key: SPARK-13365 URL: https://issues.apache.org/jira/browse/SPARK-13365 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Thomas Graves Currently if a user does a coalesce to the same number of partitions as already exist it spends a bunch of time doing stuff when it seems like it shouldn't do anything. for instance I have an RDD with 100 partitions if I run coalesce(100) it seems like it should skip any computation since it already has 100 partitions. One case I've seen this is actually when users do coalesce(1000) without the shuffle which really turns into a coalesce(100). I'm presenting this as a question as I'm not sure if there are use cases I haven't thought of where this would break. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13279) Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)
[ https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-13279. Resolution: Fixed Fix Version/s: 1.6.1 1.7 > Scheduler does O(N^2) operation when adding a new task set (making it > prohibitively slow for scheduling 200K tasks) > --- > > Key: SPARK-13279 > URL: https://issues.apache.org/jira/browse/SPARK-13279 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 1.6.0 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 1.7, 1.6.1 > > > For each task that the TaskSetManager adds, it iterates through the entire > list of existing tasks to check if it's there. As a result, scheduling a new > task set is O(N^2), which can be slow for large task sets. > This is a bug that was introduced by > https://github.com/apache/spark/commit/3535b91: that commit removed the > "!readding" condition from the if-statement, but since the re-adding > parameter defaulted to false, that commit should have removed the condition > check in the if-statement altogether. > - > We discovered this bug while running a large pipeline with 200k tasks, when > we found that the executors were not able to register with the driver because > the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks > function for a long time (it wasn't deadlocked -- just taking a long time). > jstack of the driver - http://pastebin.com/m8CP6VMv > executor log - http://pastebin.com/2NPS1mXC > From the jstack I see that the thread handing the resource offer from > executors (dispatcher-event-loop-9) is blocked on a lock held by the thread > "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer > when adding a pending tasks. So when we have 200k pending tasks, because of > this o(n2) operations, the driver is just hung for more than 5 minutes. > Solution - In addPendingTask function, we don't really need a duplicate > check. It's okay if we add a task to the same queue twice because > dequeueTaskFromList will skip already-running tasks. > Please note that this is a regression from Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts
[ https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151052#comment-15151052 ] Stephanie Bodoff edited comment on SPARK-13275 at 2/17/16 9:24 PM: --- It's a UI problem (see the screenshot webui.png). The left edge of the executor boxes should be to the right of the left edge of the job box so that they look like they are started after the job. Alternatively, the left edge of the job box could be drawn more to the left. was (Author: sbodoff): It's a UI problem. The left edge of the executor boxes should be to the right of the left edge of the job box so that they look like they are started after the job. Alternatively, the left edge of the job box could be drawn more to the left. > With dynamic allocation, executors appear to be added before job starts > --- > > Key: SPARK-13275 > URL: https://issues.apache.org/jira/browse/SPARK-13275 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Stephanie Bodoff >Priority: Minor > Attachments: webui.png > > > When I look at the timeline in the Spark Web UI I see the job starting and > then executors being added. The blue lines and dots hitting the timeline show > that the executors were added after the job started. But the way the Executor > box is rendered it looks like the executors started before the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table
[ https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151211#comment-15151211 ] Ryan Blue commented on SPARK-9926: -- I've just posted [PR #11242|https://github.com/apache/spark/pull/11242] that fixes this issue using the UnionRDD fix, but doesn't include the code for SPARK-10340, which is addressed by HADOOP-12810. > Parallelize file listing for partitioned Hive table > --- > > Key: SPARK-9926 > URL: https://issues.apache.org/jira/browse/SPARK-9926 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > > In Spark SQL, short queries like {{select * from table limit 10}} run very > slowly against partitioned Hive tables because of file listing. In > particular, if a large number of partitions are scanned on storage like S3, > the queries run extremely slowly. Here are some example benchmarks in my > environment- > * Parquet-backed Hive table > * Partitioned by dateint and hour > * Stored on S3 > ||\# of partitions||\# of files||runtime||query|| > |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit > 10;| > |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;| > |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and > dateint<=20150610 limit 10;| > The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive > partition path and group them into a UnionRDD. Then, all the input files are > listed sequentially. In other tools such as Hive and Pig, this can be solved > by setting > [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml] > high. But in Spark, since each HadoopRDD lists only one partition path, > setting this property doesn't help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table
[ https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151208#comment-15151208 ] Apache Spark commented on SPARK-9926: - User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/11242 > Parallelize file listing for partitioned Hive table > --- > > Key: SPARK-9926 > URL: https://issues.apache.org/jira/browse/SPARK-9926 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > > In Spark SQL, short queries like {{select * from table limit 10}} run very > slowly against partitioned Hive tables because of file listing. In > particular, if a large number of partitions are scanned on storage like S3, > the queries run extremely slowly. Here are some example benchmarks in my > environment- > * Parquet-backed Hive table > * Partitioned by dateint and hour > * Stored on S3 > ||\# of partitions||\# of files||runtime||query|| > |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit > 10;| > |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;| > |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and > dateint<=20150610 limit 10;| > The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive > partition path and group them into a UnionRDD. Then, all the input files are > listed sequentially. In other tools such as Hive and Pig, this can be solved > by setting > [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml] > high. But in Spark, since each HadoopRDD lists only one partition path, > setting this property doesn't help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13364) history server application column not sorting properly
Thomas Graves created SPARK-13364: - Summary: history server application column not sorting properly Key: SPARK-13364 URL: https://issues.apache.org/jira/browse/SPARK-13364 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Reporter: Thomas Graves The new history server is using datatables, the application column isn't sorting them properly. Its not sorting the last _X part right. below is an example where the 30174 should be before 30149 application_1453493359692_30149 application_1453493359692_30174 I'm guessing its sorting used the string rather then just the application id. application_1453493359692_30029 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151116#comment-15151116 ] Max Seiden commented on SPARK-12449: [~rxin] Given that predicate pushdown via `sources.Filter` is (afaik) a stable API, conceivably that model could be extended to support ever richer operations (i.e. sources.Expression, sources.Limit, sources.Join, sources.Aggregation). In this case, the stable APIs remain a derivative of the Catalyst plans and all that needs to change between releases is the compilation from Catalyst => Sources. cc [~marmbrus] since we talked briefly about this idea in person at Spark Summit > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13363) Aggregator not working with DataFrame
koert kuipers created SPARK-13363: - Summary: Aggregator not working with DataFrame Key: SPARK-13363 URL: https://issues.apache.org/jira/browse/SPARK-13363 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: koert kuipers Priority: Minor org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class for user-defined aggregations, which can be used in [[DataFrame]] and [[Dataset]] it works well with Dataset/GroupedDataset, but i am having no luck using it with DataFrame/GroupedData. does anyone have an example how to use it with a DataFrame? in particular i would like to use it with this method in GroupedData: {noformat} def agg(expr: Column, exprs: Column*): DataFrame {noformat} clearly it should be possible, since GroupedDataset uses that very same method to do the work: {noformat} private def agg(exprs: Column*): DataFrame = groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*) {noformat} the trick seems to be the wrapping in withEncoder, which is private. i tried to do something like it myself, but i had no luck since it uses more private stuff in TypedColumn. anyhow, my attempt at using it in DataFrame: {noformat} val simpleSum = new SqlAggregator[Int, Int, Int] { def zero: Int = 0 // The initial value. def reduce(b: Int, a: Int) = b + a// Add an element to the running total def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values. def finish(b: Int) = b// Return the final result. }.toColumn val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v") df.groupBy("k").agg(simpleSum).show {noformat} and the resulting error: {noformat} org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13350) Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python"
[ https://issues.apache.org/jira/browse/SPARK-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13350. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 11239 [https://github.com/apache/spark/pull/11239] > Configuration documentation incorrectly states that PYSPARK_PYTHON's default > is "python" > > > Key: SPARK-13350 > URL: https://issues.apache.org/jira/browse/SPARK-13350 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Christopher Aycock >Assignee: Christopher Aycock >Priority: Trivial > Labels: newbie > Fix For: 2.0.0, 1.6.1 > > Original Estimate: 24h > Remaining Estimate: 24h > > The configuration documentation states that the environment variable > PYSPARK_PYTHON has a default value of {{python}}: > http://spark.apache.org/docs/latest/configuration.html > In fact, the default is {{python2.7}}: > https://github.com/apache/spark/blob/4f60651cbec1b4c9cc2e6d832ace77e89a233f3a/bin/pyspark#L39-L45 > The change that introduced this was discussed here: > https://github.com/apache/spark/pull/2651 > Would it be possible to highlight this in the documentation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13350) Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python"
[ https://issues.apache.org/jira/browse/SPARK-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13350: --- Assignee: Christopher Aycock > Configuration documentation incorrectly states that PYSPARK_PYTHON's default > is "python" > > > Key: SPARK-13350 > URL: https://issues.apache.org/jira/browse/SPARK-13350 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Christopher Aycock >Assignee: Christopher Aycock >Priority: Trivial > Labels: newbie > Original Estimate: 24h > Remaining Estimate: 24h > > The configuration documentation states that the environment variable > PYSPARK_PYTHON has a default value of {{python}}: > http://spark.apache.org/docs/latest/configuration.html > In fact, the default is {{python2.7}}: > https://github.com/apache/spark/blob/4f60651cbec1b4c9cc2e6d832ace77e89a233f3a/bin/pyspark#L39-L45 > The change that introduced this was discussed here: > https://github.com/apache/spark/pull/2651 > Would it be possible to highlight this in the documentation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151015#comment-15151015 ] Sean Owen commented on SPARK-9273: -- No, I mean that I expect it will start life as an external package. In any event, I want to at least convene any related discussion in 1 JIRA, not N. CNN is a type of ANN so I don't see a value in discussing it separately. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150960#comment-15150960 ] Alexander Ulanov commented on SPARK-9273: - [~srowen] Do you mean that CNN will never be merged into Spark ML? > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9844) File appender race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150913#comment-15150913 ] Bryan Cutler commented on SPARK-9844: - This error is benign for the most part, once it gets here, the worker is already being shut down. So it is probably something else that is causing your worker to shut down. > File appender race condition during SparkWorker shutdown > > > Key: SPARK-9844 > URL: https://issues.apache.org/jira/browse/SPARK-9844 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Alex Liu >Assignee: Bryan Cutler > Fix For: 1.6.1, 2.0.0 > > > We find this issue still exists in 1.3.1 > {code} > ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 > - Error writing stream to file > /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr > ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 > - java.io.IOException: Stream closed > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.FilterInputStream.read(FilterInputStream.java:107) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,656 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > {code} > at > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L159 > The process auto shuts down, but the log appenders are still running, which > causes the error log messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13349) adding a split and union to a streaming application cause big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150893#comment-15150893 ] krishna ramachandran commented on SPARK-13349: -- i have simple synthetic example below. created 2 "raw streams" and job 1 is materialized when stream1 is output (some action print/save) In job1 val stream1 = ssc.union(rawStreams).filter(_.contains("Stream:first")) save.stream1 ... .. job2 create another split using rawStreams and union with stream1 val stream2 = ssc.union(rawStreams).filter(_.contains("Batch:second")) val stream3 = stream1.union(stream2) .. save.stream3 job2 is materialized and executed. This pattern is executed for every batch Looking at visual DAG I see, job1 executes first graph and job2 computes both "stream1" and "stream2" Caching DStream stream1 (result from job1) makes job2 go almost twice as fast In our real app, we have 7 such jobs per batch and typically we union output of job5 with job1. That is, union output of 1 with stream generated during job5. Caching and reusing output of job1 (stream1) is very efficient (per batch execution is 2.5 times faster) - but we start seeing out of memory errors I would like to be able to "unpersist" stream1 after the union (for that batch) > adding a split and union to a streaming application cause big performance hit > - > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > Fix For: 1.4.2 > > > We have a streaming application containing approximately 12 jobs every batch, > running in streaming mode (4 sec batches). Each job writes output to cassandra > each job can contain several stages. > job 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more jobs of transforms and save to database. > Around stage 5, we union the output of Dstream from job 1 (in red) with > another stream (generated by split during job 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream from job 1(red block output), performance improves substantially but > hit out of memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12675) Executor dies because of ClassCastException and causes timeout
[ https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandru Rosianu reopened SPARK-12675: --- Reopening because other users are still reporting this. > Executor dies because of ClassCastException and causes timeout > -- > > Key: SPARK-12675 > URL: https://issues.apache.org/jira/browse/SPARK-12675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 > Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz >Reporter: Alexandru Rosianu >Priority: Minor > > I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script > which doesn't work (a bit simplified): > {code:title=Script.scala} > // Prepare data sets > logInfo("Getting datasets") > val emoTrainingData = > sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet") > val trainingData = emoTrainingData > // Configure the pipeline > val pipeline = new Pipeline().setStages(Array( > new > FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"), > new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"), > new Tokenizer().setInputCol("text").setOutputCol("raw_words"), > new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"), > new HashingTF().setInputCol("words").setOutputCol("features"), > new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"), > new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", > "raw_words", "words", "features") > )) > // Fit the pipeline > logInfo(s"Training model on ${trainingData.count()} rows") > val model = pipeline.fit(trainingData) > {code} > It executes up to the last line. It prints "Training model on xx rows", then > it starts fitting, the executor dies, the drivers doesn't receive heartbeats > from the executor and it times out, then the script exits. It doesn't get > past that line. > This is the exception that kills the executor: > {code} > java.io.IOException: java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.HashMap$SerializationProxy to field > org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type > scala.collection.immutable.Map in instance of > org.apache.spark.executor.TaskMetrics > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) > at > org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) > at >
[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout
[ https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150872#comment-15150872 ] Sven Krasser commented on SPARK-12675: -- More findings (Spark 1.6.0): For our initial 200 partition use case, reducing it to 2 partitions temporarily suppressed the problems. After adding a couple of additional joins into the dataflow, we now even see this issue with just 2 partitions. My suspicion is that stages with empty tasks contribute to this condition occurring. [~aluxian], can you reopen? As reporter you should have permissions to do that. Thanks everyone! > Executor dies because of ClassCastException and causes timeout > -- > > Key: SPARK-12675 > URL: https://issues.apache.org/jira/browse/SPARK-12675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 > Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz >Reporter: Alexandru Rosianu >Priority: Minor > > I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script > which doesn't work (a bit simplified): > {code:title=Script.scala} > // Prepare data sets > logInfo("Getting datasets") > val emoTrainingData = > sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet") > val trainingData = emoTrainingData > // Configure the pipeline > val pipeline = new Pipeline().setStages(Array( > new > FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"), > new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"), > new Tokenizer().setInputCol("text").setOutputCol("raw_words"), > new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"), > new HashingTF().setInputCol("words").setOutputCol("features"), > new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"), > new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", > "raw_words", "words", "features") > )) > // Fit the pipeline > logInfo(s"Training model on ${trainingData.count()} rows") > val model = pipeline.fit(trainingData) > {code} > It executes up to the last line. It prints "Training model on xx rows", then > it starts fitting, the executor dies, the drivers doesn't receive heartbeats > from the executor and it times out, then the script exits. It doesn't get > past that line. > This is the exception that kills the executor: > {code} > java.io.IOException: java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.HashMap$SerializationProxy to field > org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type > scala.collection.immutable.Map in instance of > org.apache.spark.executor.TaskMetrics > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) > at > org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468) > at
[jira] [Commented] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150863#comment-15150863 ] Apache Spark commented on SPARK-13328: -- User 'nezihyigitbasi' has created a pull request for this issue: https://github.com/apache/spark/pull/11241 > Possible poor read performance for broadcast variables with dynamic resource > allocation > --- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13328: Assignee: Apache Spark > Possible poor read performance for broadcast variables with dynamic resource > allocation > --- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi >Assignee: Apache Spark > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13328: Assignee: (was: Apache Spark) > Possible poor read performance for broadcast variables with dynamic resource > allocation > --- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9844) File appender race condition during SparkWorker shutdown
[ https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150862#comment-15150862 ] Marcelo Balloni Gomes commented on SPARK-9844: -- Is there any way of avoiding this error in 1.4.1? All my workers are being shut down a few times a day. I was thinking about turning off the log appenders, however I'm not sure that it would be of any help. > File appender race condition during SparkWorker shutdown > > > Key: SPARK-9844 > URL: https://issues.apache.org/jira/browse/SPARK-9844 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.0 >Reporter: Alex Liu >Assignee: Bryan Cutler > Fix For: 1.6.1, 2.0.0 > > > We find this issue still exists in 1.3.1 > {code} > ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 > - Error writing stream to file > /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr > ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 > - java.io.IOException: Stream closed > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 > - at java.io.FilterInputStream.read(FilterInputStream.java:107) > ~[na:1.8.0_40] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) > ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > ERROR [Thread-6] 2015-07-28 22:49:57,656 SparkWorker-0 ExternalLogger.java:96 > - at > org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) > [spark-core_2.10-1.3.1.1.jar:1.3.1.1] > {code} > at > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L159 > The process auto shuts down, but the log appenders are still running, which > causes the error log messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables
[ https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150799#comment-15150799 ] Ryan Blue commented on SPARK-10340: --- >From discussion on the pull request, it looks like the solution is to add >parallelism to UnionRDD rather than use the Qubole approach of listing a >common prefix and filtering. I'm also linking to HADOOP-12810, which fixes a >problem in FileSystem that was causing a FileStatus to be fetched for each >file. The performance numbers show that both fixes address the performance >problem without adding an alternate code path that avoids delegating to the >InputFormat for split calculations on S3. I think the way forward is to get >the UnionRDD patch committed and backport the fix for FileSystem. > Use S3 bulk listing for S3-backed Hive tables > - > > Key: SPARK-10340 > URL: https://issues.apache.org/jira/browse/SPARK-10340 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > > AWS S3 provides bulk listing API. It takes the common prefix of all input > paths as a parameter and returns all the objects whose prefixes start with > the common prefix in blocks of 1000. > Since SPARK-9926 allow us to list multiple partitions all together, we can > significantly speed up input split calculation using S3 bulk listing. This > optimization is particularly useful for queries like {{select * from > partitioned_table limit 10}}. > This is a common optimization for S3. For eg, here is a [blog > post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] > from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13322: -- Target Version/s: 2.0.0 > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity because we do not standardize the > feature before fitting model, we should support feature standardization. > Another benefit is that standardization will improve the convergence rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13322: -- Assignee: Yanbo Liang > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity because we do not standardize the > feature before fitting model, we should support feature standardization. > Another benefit is that standardization will improve the convergence rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space
[ https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150683#comment-15150683 ] Mohit Garg commented on SPARK-13362: thanks. > Build Error: java.lang.OutOfMemoryError: PermGen space > -- > > Key: SPARK-13362 > URL: https://issues.apache.org/jira/browse/SPARK-13362 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 > OS Type: 64-bit > OS: Ubuntu 15.04 > JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode) > Java: version 1.7.0_79, vendor Oracle Corporation >Reporter: Mohit Garg >Priority: Trivial > Labels: build > Attachments: Error.png > > Original Estimate: 1h > Remaining Estimate: 1h > > While building spark from source: > > git clone https://github.com/apache/spark.git > > cd spark > > mvn -DskipTests clean package -e > ERROR: > [ERROR] PermGen space -> [Help 1] > java.lang.OutOfMemoryError: PermGen space > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2615) > at java.lang.Class.getDeclaredMethod(Class.java:2007) > at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235) > at > xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222) > at scala.tools.nsc.Global$Run.(Global.scala:1237) > at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113) > at xsbt.CachedCompiler0.run(CompilerInterface.scala:113) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space
[ https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Garg updated SPARK-13362: --- Attachment: Error.png VisualVM snapshot > Build Error: java.lang.OutOfMemoryError: PermGen space > -- > > Key: SPARK-13362 > URL: https://issues.apache.org/jira/browse/SPARK-13362 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 > OS Type: 64-bit > OS: Ubuntu 15.04 > JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode) > Java: version 1.7.0_79, vendor Oracle Corporation >Reporter: Mohit Garg >Priority: Trivial > Labels: build > Attachments: Error.png > > Original Estimate: 1h > Remaining Estimate: 1h > > While building spark from source: > > git clone https://github.com/apache/spark.git > > cd spark > > mvn -DskipTests clean package -e > ERROR: > [ERROR] PermGen space -> [Help 1] > java.lang.OutOfMemoryError: PermGen space > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2615) > at java.lang.Class.getDeclaredMethod(Class.java:2007) > at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235) > at > xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222) > at scala.tools.nsc.Global$Run.(Global.scala:1237) > at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113) > at xsbt.CachedCompiler0.run(CompilerInterface.scala:113) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space
[ https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13362. --- Resolution: Not A Problem Fix Version/s: (was: 1.5.2) Please read the build docs. You didn't set your MAVEN_OPTS. Please also read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a JIRA > Build Error: java.lang.OutOfMemoryError: PermGen space > -- > > Key: SPARK-13362 > URL: https://issues.apache.org/jira/browse/SPARK-13362 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 > OS Type: 64-bit > OS: Ubuntu 15.04 > JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode) > Java: version 1.7.0_79, vendor Oracle Corporation >Reporter: Mohit Garg >Priority: Trivial > Labels: build > Original Estimate: 1h > Remaining Estimate: 1h > > While building spark from source: > > git clone https://github.com/apache/spark.git > > cd spark > > mvn -DskipTests clean package -e > ERROR: > [ERROR] PermGen space -> [Help 1] > java.lang.OutOfMemoryError: PermGen space > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2615) > at java.lang.Class.getDeclaredMethod(Class.java:2007) > at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235) > at > xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222) > at scala.tools.nsc.Global$Run.(Global.scala:1237) > at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113) > at xsbt.CachedCompiler0.run(CompilerInterface.scala:113) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space
[ https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Garg updated SPARK-13362: --- Issue Type: Bug (was: Improvement) > Build Error: java.lang.OutOfMemoryError: PermGen space > -- > > Key: SPARK-13362 > URL: https://issues.apache.org/jira/browse/SPARK-13362 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 > OS Type: 64-bit > OS: Ubuntu 15.04 > JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode) > Java: version 1.7.0_79, vendor Oracle Corporation >Reporter: Mohit Garg >Priority: Trivial > Labels: build > Fix For: 1.5.2 > > Original Estimate: 1h > Remaining Estimate: 1h > > While building spark from source: > > git clone https://github.com/apache/spark.git > > cd spark > > mvn -DskipTests clean package -e > ERROR: > [ERROR] PermGen space -> [Help 1] > java.lang.OutOfMemoryError: PermGen space > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.getDeclaredMethods0(Native Method) > at java.lang.Class.privateGetDeclaredMethods(Class.java:2615) > at java.lang.Class.getDeclaredMethod(Class.java:2007) > at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235) > at > xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227) > at > xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222) > at scala.tools.nsc.Global$Run.(Global.scala:1237) > at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113) > at xsbt.CachedCompiler0.run(CompilerInterface.scala:113) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space
Mohit Garg created SPARK-13362: -- Summary: Build Error: java.lang.OutOfMemoryError: PermGen space Key: SPARK-13362 URL: https://issues.apache.org/jira/browse/SPARK-13362 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.5.2 Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 OS Type: 64-bit OS: Ubuntu 15.04 JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode) Java: version 1.7.0_79, vendor Oracle Corporation Reporter: Mohit Garg Priority: Trivial Fix For: 1.5.2 While building spark from source: > git clone https://github.com/apache/spark.git > cd spark > mvn -DskipTests clean package -e ERROR: [ERROR] PermGen space -> [Help 1] java.lang.OutOfMemoryError: PermGen space at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2615) at java.lang.Class.getDeclaredMethod(Class.java:2007) at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235) at xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230) at xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227) at xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222) at scala.tools.nsc.Global$Run.(Global.scala:1237) at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113) at xsbt.CachedCompiler0.run(CompilerInterface.scala:113) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide
[ https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150654#comment-15150654 ] Apache Spark commented on SPARK-10759: -- User 'JeremyNixon' has created a pull request for this issue: https://github.com/apache/spark/pull/11240 > Missing Python code example in ML Programming guide > --- > > Key: SPARK-10759 > URL: https://issues.apache.org/jira/browse/SPARK-10759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Raela Wang >Assignee: Apache Spark >Priority: Minor > Labels: starter > > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9273. -- Resolution: Duplicate [~asimjalis] it's not going to happen (directly) in Spark anyway, but this is not different enough from the JIRA I linked to reopen this. Please don't. See other exact dupes that are linked to that issue. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-9273. > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org