date:20160217

[jira] [Commented] (SPARK-13331) Spark network encryption optimization

2016-02-17 Thread Dong Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151892#comment-15151892
 ] 

Dong Chen commented on SPARK-13331:
---

Sorry for the confusion, below is the change would entail in Spark. Please let 
me know if any confusions.

SPARK-6229 add SASL encryption to network library. The encryption support 3DES, 
DES, and RC4. This JIRA intends to make the encryption support AES for better 
performance.

The change in Spark would involve:
* add code in {{SaslClientBootstrap.doBootstrap()}} and 
{{SaslRpcHandler.receive()}} to negotiate AES encryption.
* Then update {{SparkSaslClient}} and {{SparkSaslServer}} to {{wrap}} / 
{{unwrap}} message with AES.

SPARK-10771 and this JIRA has same point that they both use JCE or a library to 
implement AES. But have different focus in Spark. SPARK-10771 is for data 
encryption when writing / reading shuffle data to disk, and this JIRA is for 
data encryption when transfering data on wire.

> Spark network encryption optimization
> -
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Max Seiden (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151880#comment-15151880
 ] 

Max Seiden commented on SPARK-12449:


Yea, that seems to be the case. There's code in the DataSourceStrategy that 
specifically resolves aliases, but the filtered scan case is pretty narrow 
relative to an expression tree.  

+1 for a generic way to avoid double execution of operations. On the flip, a 
boolean check would drop a neat property of "unhandledFilters" which is that it 
can accept a subset of what the planner tries to push down.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13373) Generate code for sort merge join

2016-02-17 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13373:
--

 Summary: Generate code for sort merge join
 Key: SPARK-13373
 URL: https://issues.apache.org/jira/browse/SPARK-13373
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-17 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13354.
--
Resolution: Duplicate

> Push filter throughout outer join when the condition can filter out empty row 
> --
>
> Key: SPARK-13354
> URL: https://issues.apache.org/jira/browse/SPARK-13354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> For a query
> {code}
> select * from a left outer join b on a.a = b.a where b.b > 10
> {code}
> The condition `b.b > 10` will filter out all the row that the b part of it is 
> empty.
> In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13370) Lexer not handling whitespaces properly

2016-02-17 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151822#comment-15151822
 ] 

Herman van Hovell edited comment on SPARK-13370 at 2/18/16 7:35 AM:


Whitespace is optional. This may sound funny, but the following expression is 
perfectly legal: {{select 1+1}}. In 99% of the time cases without proper 
whitespace get caught by either eager lexer rules or parse rules.

-I'll leave it open for discussion, but I think this is a won't fix.-


was (Author: hvanhovell):
Whitespace is optional. This may sound funny, but the following expression is 
perfectly legal: {{select 1+1}}. In 99% of the time cases without proper 
whitespace get caught by either eager lexer rules or parse rules.

I'll leave it open for discussion, but I think this is a won't fix.

> Lexer not handling whitespaces properly
> ---
>
> Key: SPARK-13370
> URL: https://issues.apache.org/jira/browse/SPARK-13370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> I was experimenting numeric suffixes and came to the following wrong query 
> string, which should result in a parsing error:
> {code}
> // 1.0L is illegal here
> sqlContext.sql("SELECT 1.0D + 1.0L").show()
> {code}
> However, it gives the following result:
> {noformat}
> +---+
> |  L|
> +---+
> |2.0|
> +---+
> {noformat}
> {{explain}} suggests that the {{L}} is recognized as an alias:
> {noformat}
> sqlContext.sql("SELECT 1.0D + 1.0L").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)]
> +- OneRowRelation$
> {noformat}
> Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's 
> an whitespace there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13370) Lexer not handling whitespaces properly

2016-02-17 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151870#comment-15151870
 ] 

Herman van Hovell commented on SPARK-13370:
---

Thought about this a bit more, and realized we can fix this by explicitly 
adding a whitespace token to the alias section of the namedExpression rule. 
I'll submit a PR and the end of the day.

> Lexer not handling whitespaces properly
> ---
>
> Key: SPARK-13370
> URL: https://issues.apache.org/jira/browse/SPARK-13370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> I was experimenting numeric suffixes and came to the following wrong query 
> string, which should result in a parsing error:
> {code}
> // 1.0L is illegal here
> sqlContext.sql("SELECT 1.0D + 1.0L").show()
> {code}
> However, it gives the following result:
> {noformat}
> +---+
> |  L|
> +---+
> |2.0|
> +---+
> {noformat}
> {{explain}} suggests that the {{L}} is recognized as an alias:
> {noformat}
> sqlContext.sql("SELECT 1.0D + 1.0L").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)]
> +- OneRowRelation$
> {noformat}
> Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's 
> an whitespace there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13371) TaskSetManager.dequeueSpeculativeTask compares Option[String] and String directly.

2016-02-17 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-13371:

Summary: TaskSetManager.dequeueSpeculativeTask compares Option[String] and 
String directly.  (was: Compare Option[String] and String directly in )

> TaskSetManager.dequeueSpeculativeTask compares Option[String] and String 
> directly.
> --
>
> Key: SPARK-13371
> URL: https://issues.apache.org/jira/browse/SPARK-13371
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Guoqiang Li
>
> {noformat}
> TaskSetManager.dequeueSpeculativeTask compares Option[String] and String 
> directly.
> {noformat}
> Ths code: 
> https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344
> {code}
>  if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
> for (rack <- sched.getRackForHost(host)) {
>   for (index <- speculatableTasks if canRunOnHost(index)) {
> val racks = 
> tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost)
> // racks: Seq[Option[String]] and rack: String
> if (racks.contains(rack)) {
>   speculatableTasks -= index
>   return Some((index, TaskLocality.RACK_LOCAL))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13371) Compare Option[String] and String directly in

2016-02-17 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-13371:

Summary: Compare Option[String] and String directly in   (was: Compare 
Option[String] and String directly)

> Compare Option[String] and String directly in 
> --
>
> Key: SPARK-13371
> URL: https://issues.apache.org/jira/browse/SPARK-13371
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Guoqiang Li
>
> {noformat}
> TaskSetManager.dequeueSpeculativeTask compares Option[String] and String 
> directly.
> {noformat}
> Ths code: 
> https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344
> {code}
>  if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
> for (rack <- sched.getRackForHost(host)) {
>   for (index <- speculatableTasks if canRunOnHost(index)) {
> val racks = 
> tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost)
> // racks: Seq[Option[String]] and rack: String
> if (racks.contains(rack)) {
>   speculatableTasks -= index
>   return Some((index, TaskLocality.RACK_LOCAL))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13371) Compare Option[String] and String directly

2016-02-17 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-13371:

Description: 
{noformat}
TaskSetManager.dequeueSpeculativeTask compares Option[String] and String 
directly.
{noformat}
Ths code: 
https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344

{code}
 if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
for (rack <- sched.getRackForHost(host)) {
  for (index <- speculatableTasks if canRunOnHost(index)) {
val racks = 
tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost)
// racks: Seq[Option[String]] and rack: String
if (racks.contains(rack)) {
  speculatableTasks -= index
  return Some((index, TaskLocality.RACK_LOCAL))
}
  }
}
  }
{code}

  was:
{noformat}
TaskSetManager.dequeueSpeculativeTask compares  Option[String] and String 
directly.
{noformat}
Ths code: 
https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344

{code}
 if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
for (rack <- sched.getRackForHost(host)) {
  for (index <- speculatableTasks if canRunOnHost(index)) {
val racks = 
tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost)
// racks: Seq[Option[String]] and rack: String
if (racks.contains(rack)) {
  speculatableTasks -= index
  return Some((index, TaskLocality.RACK_LOCAL))
}
  }
}
  }
{code}


> Compare Option[String] and String directly
> --
>
> Key: SPARK-13371
> URL: https://issues.apache.org/jira/browse/SPARK-13371
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Guoqiang Li
>
> {noformat}
> TaskSetManager.dequeueSpeculativeTask compares Option[String] and String 
> directly.
> {noformat}
> Ths code: 
> https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344
> {code}
>  if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
> for (rack <- sched.getRackForHost(host)) {
>   for (index <- speculatableTasks if canRunOnHost(index)) {
> val racks = 
> tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost)
> // racks: Seq[Option[String]] and rack: String
> if (racks.contains(rack)) {
>   speculatableTasks -= index
>   return Some((index, TaskLocality.RACK_LOCAL))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13331) Spark network encryption optimization

2016-02-17 Thread Dong Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Chen updated SPARK-13331:
--
Description: 
In network/common, SASL with DIGEST-MD5 authentication is used for negotiating 
a secure communication channel. When SASL operation mode is "auth-conf", the 
data transferred on the network is encrypted. DIGEST-MD5 mechanism supports 
following encryption: 3DES, DES, and RC4. The negotiation procedure will select 
one of them to encrypt / decrypt the data on the channel.

However, 3des and rc4 are slow relatively. We could add code in the negotiation 
to make it support AES for more secure and performance.

The proposed solution is:
When "auth-conf" is enabled, at the end of original negotiation, the 
authentication succeeds and a secure channel is built. We could add one more 
negotiation step: Client and server negotiate whether they both support AES. If 
yes, the Key and IV used by AES will be generated by server and sent to client 
through the already secure channel. Then update the encryption / decryption 
handler to AES at both client and server side. Following data transfer will use 
AES instead of original encryption algorithm.

  was:
In network/common, SASL with DIGEST-MD5 authentication is used for negotiating 
a secure communication channel. When SASL operation mode is "auth-conf", the 
data transferred on the network is encrypted. DIGEST-MD5 mechanism supports 
following encryption: 3DES, DES, and RC4. The negotiation procedure will select 
one of them to encrypt / decrypt the data on the channel.

However, 3des and rc4 are slow relatively. We could add code in the negotiation 
to make it support AES for more secure and performance.

The proposal is:
When "auth-conf" is enabled, at the end of original negotiation, the 
authentication succeeds and a secure channel is built. We could add one more 
negotiation step: Client and server negotiate whether they both support AES. If 
yes, the Key and IV used by AES will be generated by server and sent to client 
through the already secure channel. Then update the encryption / decryption 
handler to AES at both client and server side. Following data transfer will use 
AES instead of original encryption algorithm.


> Spark network encryption optimization
> -
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Evan Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151823#comment-15151823
 ] 

Evan Chan commented on SPARK-12449:
---

I think in the case of sources.Expressions, by the time they are pushed down, 
all aliases etc should have been resolved already, so that should not be an 
issue, right?

Agree that capabilities would be important.   If that didn’t exist, then the 
default would be to not compute the expressions and let Spark’s default 
aggregators do it, which means it would be like the filtering today where there 
is double filtering.





> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151824#comment-15151824
 ] 

Apache Spark commented on SPARK-13372:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11247

> ML LogisticRegression behaves incorrectly when standardization = false && 
> regParam = 0.0 
> -
>
> Key: SPARK-13372
> URL: https://issues.apache.org/jira/browse/SPARK-13372
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML LogisticRegression behaves incorrectly when standardization = false && 
> regParam = 0.0 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13372:


Assignee: (was: Apache Spark)

> ML LogisticRegression behaves incorrectly when standardization = false && 
> regParam = 0.0 
> -
>
> Key: SPARK-13372
> URL: https://issues.apache.org/jira/browse/SPARK-13372
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML LogisticRegression behaves incorrectly when standardization = false && 
> regParam = 0.0 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13372:


Assignee: Apache Spark

> ML LogisticRegression behaves incorrectly when standardization = false && 
> regParam = 0.0 
> -
>
> Key: SPARK-13372
> URL: https://issues.apache.org/jira/browse/SPARK-13372
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> ML LogisticRegression behaves incorrectly when standardization = false && 
> regParam = 0.0 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13370) Lexer not handling whitespaces properly

2016-02-17 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151822#comment-15151822
 ] 

Herman van Hovell edited comment on SPARK-13370 at 2/18/16 6:58 AM:


Whitespace is optional. This may sound funny, but the following expression is 
perfectly legal: {{select 1+1}}. In 99% of the time cases without proper 
whitespace get caught by either eager lexer rules or parse rules.

I'll leave it open for discussion, but I think this is a won't fix.


was (Author: hvanhovell):
Whitespace is optional. This may sound funny, but the following expression is 
perfectly legal: {{select 1+1}}. In 99% of the time cases without whitespace 
get caught by either eager lexer rules or parse rules.

I'll leave it open for discussion, but I think this is a won't fix.

> Lexer not handling whitespaces properly
> ---
>
> Key: SPARK-13370
> URL: https://issues.apache.org/jira/browse/SPARK-13370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> I was experimenting numeric suffixes and came to the following wrong query 
> string, which should result in a parsing error:
> {code}
> // 1.0L is illegal here
> sqlContext.sql("SELECT 1.0D + 1.0L").show()
> {code}
> However, it gives the following result:
> {noformat}
> +---+
> |  L|
> +---+
> |2.0|
> +---+
> {noformat}
> {{explain}} suggests that the {{L}} is recognized as an alias:
> {noformat}
> sqlContext.sql("SELECT 1.0D + 1.0L").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)]
> +- OneRowRelation$
> {noformat}
> Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's 
> an whitespace there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13370) Lexer not handling whitespaces properly

2016-02-17 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151822#comment-15151822
 ] 

Herman van Hovell commented on SPARK-13370:
---

Whitespace is optional. This may sound funny, but the following expression is 
perfectly legal: {{select 1+1}}. In 99% of the time cases without whitespace 
get caught by either eager lexer rules or parse rules.

I'll leave it open for discussion, but I think this is a won't fix.

> Lexer not handling whitespaces properly
> ---
>
> Key: SPARK-13370
> URL: https://issues.apache.org/jira/browse/SPARK-13370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> I was experimenting numeric suffixes and came to the following wrong query 
> string, which should result in a parsing error:
> {code}
> // 1.0L is illegal here
> sqlContext.sql("SELECT 1.0D + 1.0L").show()
> {code}
> However, it gives the following result:
> {noformat}
> +---+
> |  L|
> +---+
> |2.0|
> +---+
> {noformat}
> {{explain}} suggests that the {{L}} is recognized as an alias:
> {noformat}
> sqlContext.sql("SELECT 1.0D + 1.0L").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)]
> +- OneRowRelation$
> {noformat}
> Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's 
> an whitespace there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2090) spark-shell input text entry not showing on REPL

2016-02-17 Thread Lantao Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151802#comment-15151802
 ] 

Lantao Jin edited comment on SPARK-2090 at 2/18/16 6:36 AM:


Richard is right, this is the permissions problem on home directory. 
In many cases, the user don't have full permissions with user home. For 
example, we use LDAP system to login, but it is forbidden to create any user 
directory. It will lead the Spark REPL show nothing about input text. The 
source code root cause is below: 
 Scala load "use.home" from system properties to create a default history file, 
code is:
{code:title=org.apache.spark.repl.SparkJLineReader.scala|borderStyle=solid}
/** Changes the default history file to not collide with the scala repl's. */
private[repl] class SparkJLineHistory extends JLineFileHistory {
  import Properties.userHome

  def defaultFileName = ".spark_history"
  override protected lazy val historyFile = File(Path(userHome) / 
defaultFileName)
}
{code}
And "userHome" is defined in scala.util.PropertiesTrait
{code:title=scala.util.PropertiesTrait.scala|borderStyle=solid}
def userHome  = propOrEmpty("user.home")
{code}
It will load from library.properties build-in scala-library.jar. In most cases, 
it will use the system default value. If the default value directory has no 
write permission for the login user. The file ".spark_history" can not be 
create and spark-shell cat not show any input text on REPL.
So, I do two step to resolve this problem(Our company use LDAP account to login 
a machine and forbid to create any LDAP user directory)
{panel}
1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER"  
to spark-env.sh
2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit 
or spark-shell
{panel}
then it will works.
Next I will review the spark trunk code to find how to resolve it gracefully.


was (Author: cltlfcjin):
Richard is right, this is the permissions problem on home directory. 
In many cases, the user don't have full permissions with user home. For 
example, we use LDAP system to login, but it is forbidden to create any user 
directory. It will lead the Spark REPL show nothing about input text. The 
source code root cause is below: 
 Scala load "use.home" from system properties to create a default history file, 
code is:
{code:title=org.apache.spark.repl.SparkJLineReader.scala|borderStyle=solid}
/** Changes the default history file to not collide with the scala repl's. */
private[repl] class SparkJLineHistory extends JLineFileHistory {
  import Properties.userHome

  def defaultFileName = ".spark_history"
  override protected lazy val historyFile = File(Path(userHome) / 
defaultFileName)
}
{code}
And "userHome" is defined in scala.util.PropertiesTrait
{code:title=scala.util.PropertiesTrait.scala|borderStyle=solid}
def userHome  = propOrEmpty("user.home")
{code}
It will load from library.properties build-in scala-library.jar. In most cases, 
it will use the system default value. If the default value directory has no 
write permission for the login user. The file ".spark_history" can not be 
create and spark-shell cat not show any input text on REPL.
So, I do two step to resolve this problem(Our company use LDAP account to login 
a machine and forbid to create any LDAP user directory)
1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER"  
to spark-env.sh
2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit 
or spark-shell
then it will works.
Next I will review the spark trunk code to find how to resolve it gracefully.

> spark-shell input text entry not showing on REPL
> 
>
> Key: SPARK-2090
> URL: https://issues.apache.org/jira/browse/SPARK-2090
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.0
> Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java 
> HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
>Reporter: Richard Conway
>Priority: Critical
>  Labels: easyfix, patch
> Fix For: 1.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> spark-shell doesn't allow text to be displayed on input 
> Failed to created SparkJLineReader: java.io.IOException: Permission denied
> Falling back to SimpleReader.
> The driver has 2 workers on 2 virtual machines and error free apart from the 
> above line so I think it may have something to do with the introduction of 
> the new SecurityManager.
> The upshot is that when you type nothing is displayed on the screen. For 
> example, type "test" at the scala prompt and you won't see the input but the 
> output will show.
> scala> :11: error: package test is not a value
>   test
>

[jira] [Comment Edited] (SPARK-2090) spark-shell input text entry not showing on REPL

2016-02-17 Thread Lantao Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151802#comment-15151802
 ] 

Lantao Jin edited comment on SPARK-2090 at 2/18/16 6:35 AM:


Richard is right, this is the permissions problem on home directory. 
In many cases, the user don't have full permissions with user home. For 
example, we use LDAP system to login, but it is forbidden to create any user 
directory. It will lead the Spark REPL show nothing about input text. The 
source code root cause is below: 
 Scala load "use.home" from system properties to create a default history file, 
code is:
{code:title=org.apache.spark.repl.SparkJLineReader.scala|borderStyle=solid}
/** Changes the default history file to not collide with the scala repl's. */
private[repl] class SparkJLineHistory extends JLineFileHistory {
  import Properties.userHome

  def defaultFileName = ".spark_history"
  override protected lazy val historyFile = File(Path(userHome) / 
defaultFileName)
}
{code}
And "userHome" is defined in scala.util.PropertiesTrait
{code:title=scala.util.PropertiesTrait.scala|borderStyle=solid}
def userHome  = propOrEmpty("user.home")
{code}
It will load from library.properties build-in scala-library.jar. In most cases, 
it will use the system default value. If the default value directory has no 
write permission for the login user. The file ".spark_history" can not be 
create and spark-shell cat not show any input text on REPL.
So, I do two step to resolve this problem(Our company use LDAP account to login 
a machine and forbid to create any LDAP user directory)
1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER"  
to spark-env.sh
2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit 
or spark-shell
then it will works.
Next I will review the spark trunk code to find how to resolve it gracefully.


was (Author: cltlfcjin):
Richard is right, this is the permissions problem on home directory. 
In many cases, the user don't have full permissions with user home. For 
example, we use LDAP system to login, but it is forbidden to create any user 
directory. It will lead the Spark REPL show nothing about input text. The 
source code root cause is below: 
 Scala load "use.home" from system properties to create a default history file, 
code is:
org.apache.spark.repl.SparkJLineReader---
/** Changes the default history file to not collide with the scala repl's. */
private[repl] class SparkJLineHistory extends JLineFileHistory {
  import Properties.userHome

  def defaultFileName = ".spark_history"
  override protected lazy val historyFile = File(Path(userHome) / 
defaultFileName)
}
---
And "userHome" is defined in scala.util.PropertiesTrait
scala.util.PropertiesTrait-
def userHome  = propOrEmpty("user.home")
--
It will load from library.properties build-in scala-library.jar. In most cases, 
it will use the system default value. If the default value directory has no 
write permission for the login user. The file ".spark_history" can not be 
create and spark-shell cat not show any input text on REPL.
So, I do two step to resolve this problem(Our company use LDAP account to login 
a machine and forbid to create any LDAP user directory)
1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER"  
to spark-env.sh
2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit 
or spark-shell
then it will works.
Next I will review the spark trunk code to find how to resolve it gracefully.

> spark-shell input text entry not showing on REPL
> 
>
> Key: SPARK-2090
> URL: https://issues.apache.org/jira/browse/SPARK-2090
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.0
> Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java 
> HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
>Reporter: Richard Conway
>Priority: Critical
>  Labels: easyfix, patch
> Fix For: 1.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> spark-shell doesn't allow text to be displayed on input 
> Failed to created SparkJLineReader: java.io.IOException: Permission denied
> Falling back to SimpleReader.
> The driver has 2 workers on 2 virtual machines and error free apart from the 
> above line so I think it may have something to do with the introduction of 
> the new SecurityManager.
> The upshot is that when you type nothing is displayed on the screen. For 
> example, type "test" at the scala prompt and you won't see the input but the 
> output will show.
>

[jira] [Created] (SPARK-13372) ML LogisticRegression behaves incorrectly when standardization = false && regParam = 0.0

2016-02-17 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-13372:
---

 Summary: ML LogisticRegression behaves incorrectly when 
standardization = false && regParam = 0.0 
 Key: SPARK-13372
 URL: https://issues.apache.org/jira/browse/SPARK-13372
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang


ML LogisticRegression behaves incorrectly when standardization = false && 
regParam = 0.0 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2090) spark-shell input text entry not showing on REPL

2016-02-17 Thread Lantao Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151802#comment-15151802
 ] 

Lantao Jin commented on SPARK-2090:
---

Richard is right, this is the permissions problem on home directory. 
In many cases, the user don't have full permissions with user home. For 
example, we use LDAP system to login, but it is forbidden to create any user 
directory. It will lead the Spark REPL show nothing about input text. The 
source code root cause is below: 
 Scala load "use.home" from system properties to create a default history file, 
code is:
org.apache.spark.repl.SparkJLineReader---
/** Changes the default history file to not collide with the scala repl's. */
private[repl] class SparkJLineHistory extends JLineFileHistory {
  import Properties.userHome

  def defaultFileName = ".spark_history"
  override protected lazy val historyFile = File(Path(userHome) / 
defaultFileName)
}
---
And "userHome" is defined in scala.util.PropertiesTrait
scala.util.PropertiesTrait-
def userHome  = propOrEmpty("user.home")
--
It will load from library.properties build-in scala-library.jar. In most cases, 
it will use the system default value. If the default value directory has no 
write permission for the login user. The file ".spark_history" can not be 
create and spark-shell cat not show any input text on REPL.
So, I do two step to resolve this problem(Our company use LDAP account to login 
a machine and forbid to create any LDAP user directory)
1. add export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=/tmp/$USER"  
to spark-env.sh
2. add VM argument -Duser.home=/tmp/$USER to execution script like spark-submit 
or spark-shell
then it will works.
Next I will review the spark trunk code to find how to resolve it gracefully.

> spark-shell input text entry not showing on REPL
> 
>
> Key: SPARK-2090
> URL: https://issues.apache.org/jira/browse/SPARK-2090
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.0
> Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java 
> HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
>Reporter: Richard Conway
>Priority: Critical
>  Labels: easyfix, patch
> Fix For: 1.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> spark-shell doesn't allow text to be displayed on input 
> Failed to created SparkJLineReader: java.io.IOException: Permission denied
> Falling back to SimpleReader.
> The driver has 2 workers on 2 virtual machines and error free apart from the 
> above line so I think it may have something to do with the introduction of 
> the new SecurityManager.
> The upshot is that when you type nothing is displayed on the screen. For 
> example, type "test" at the scala prompt and you won't see the input but the 
> output will show.
> scala> :11: error: package test is not a value
>   test
>   ^



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13371) Compare Option[String] and String directly

2016-02-17 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-13371:
---

 Summary: Compare Option[String] and String directly
 Key: SPARK-13371
 URL: https://issues.apache.org/jira/browse/SPARK-13371
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.6.0, 1.5.2
Reporter: Guoqiang Li


{noformat}
TaskSetManager.dequeueSpeculativeTask compares  Option[String] and String 
directly.
{noformat}
Ths code: 
https://github.com/apache/spark/blob/87abcf7df921a5937fdb2bae8bfb30bfabc4970a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L344

{code}
 if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
for (rack <- sched.getRackForHost(host)) {
  for (index <- speculatableTasks if canRunOnHost(index)) {
val racks = 
tasks(index).preferredLocations.map(_.host).map(sched.getRackForHost)
// racks: Seq[Option[String]] and rack: String
if (racks.contains(rack)) {
  speculatableTasks -= index
  return Some((index, TaskLocality.RACK_LOCAL))
}
  }
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151746#comment-15151746
 ] 

Xiao Li edited comment on SPARK-1 at 2/18/16 5:15 AM:
--

Yeah, you are right. This part is an issue. That is why I did not remove the 
existing rand/randn function with a seed argument. I am unable to find a proper 
solution for controlling data partitioning and task scheduling when users call 
rand/randn functions with a specific seed number.  

At least, we have to let users realize these results are not deterministic. 
Actually, I think this is also an issue in many of our test cases. Many test 
cases are using seeds with an assumption that rand/rand(seed) can provide a 
deterministic result. 



was (Author: smilegator):
Yeah, you are right. This part is an issue. That is why I did not remove the 
existing rand/randn function with a seed argument. I am unable to find a 
solution for controlling data partitioning and task scheduling when users call 
rand/randn functions with a specific seed number.  

At least, we have to let users realize these results are not deterministic. 
Actually, I think this is also an issue in many of our test cases. Many test 
cases are using seeds with an assumption that rand/rand(seed) can provide a 
deterministic result. 


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151746#comment-15151746
 ] 

Xiao Li commented on SPARK-1:
-

Yeah, you are right. This part is an issue. That is why I did not remove the 
existing rand/randn function with a seed argument. I am unable to find a 
solution for controlling data partitioning and task scheduling when users call 
rand/randn functions with a specific seed number.  

At least, we have to let users realize these results are not deterministic. 
Actually, I think this is also an issue in many of our test cases. Many test 
cases are using seeds with an assumption that rand/rand(seed) can provide a 
deterministic result. 


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13370) Lexer not handling whitespaces properly

2016-02-17 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-13370:
--

 Summary: Lexer not handling whitespaces properly
 Key: SPARK-13370
 URL: https://issues.apache.org/jira/browse/SPARK-13370
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian


I was experimenting numeric suffixes and came to the following wrong query 
string, which should result in a parsing error:
{code}
// 1.0L is illegal here
sqlContext.sql("SELECT 1.0D + 1.0L").show()
{code}
However, it gives the following result:
{noformat}
+---+
|  L|
+---+
|2.0|
+---+
{noformat}
{{explain}} suggests that the {{L}} is recognized as an alias:
{noformat}
sqlContext.sql("SELECT 1.0D + 1.0L").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias((1.0 + 1.0) AS L#12, None)]
+- OneRowRelation$
{noformat}
Seems that the lexer recognizes {{1.0}} and {{L}} as two tokens as if there's 
an whitespace there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13369) Number of consecutive fetch failures for a stage before the job is aborted should be configurable

2016-02-17 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-13369:
---

 Summary:  Number of consecutive fetch failures for a stage before 
the job is aborted should be configurable
 Key: SPARK-13369
 URL: https://issues.apache.org/jira/browse/SPARK-13369
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Sital Kedia


Currently it is hardcode inside code. We need to make it configurable because 
for long running jobs, the chances of fetch failures due to machine reboot is 
high and we need a configuration parameter to bump up that number. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151720#comment-15151720
 ] 

Liang-Chi Hsieh edited comment on SPARK-1 at 2/18/16 4:13 AM:
--

The problem is when you have many data partitions, you will have same random 
number sequences in each partition.

E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you 
do something required with random number involved for each row, your 201 ~ 400 
rows will be processed with same random numbers as the first 200 rows. Then 
your computation will not be random as it requires at the beginning. Your each 
partition will share a same pattern.


was (Author: viirya):
Yes. I agree that when user provides a specific seed number, the result should 
be deterministic. The problem is when you have many data partitions, you will 
have same random number sequences in each partition.

E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you 
do something required with random number involved for each row, your 201 ~ 400 
rows will be processed with same random numbers as the first 200 rows. Then 
your computation will not be random as it requires at the beginning. Your each 
partition will share a same pattern.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151720#comment-15151720
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

Yes. I agree that when user provides a specific seed number, the result should 
be deterministic. The problem is when you have many data partitions, you will 
have same random number sequences in each partition.

E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you 
do something required with random number involved for each row, your 201 ~ 400 
rows will be processed with same random numbers as the first 200 rows. Then 
your computation will not be random as it requires at the beginning. Your each 
partition will share a same pattern.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13364) history server application column not sorting properly

2016-02-17 Thread Zhuo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151721#comment-15151721
 ] 

Zhuo Liu commented on SPARK-13364:
--

It is not sorting by , but a lexicographical sorting according to 
application Id, so application__4000 might be before application__900 
for ascending. We can add special sorting class for application Id if needed.

> history server application column not sorting properly
> --
>
> Key: SPARK-13364
> URL: https://issues.apache.org/jira/browse/SPARK-13364
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> The new history server is using datatables, the application column isn't 
> sorting them properly. Its not sorting the last _X part right. below is 
> an example where the 30174 should be before 30149
> application_1453493359692_30149 
> application_1453493359692_30174
> I'm guessing its sorting used the  string rather then just the 
> application id.
>  href="/history/application_1453493359692_30029/1/jobs/">application_1453493359692_30029



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-02-17 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151710#comment-15151710
 ] 

Xusen Yin commented on SPARK-13368:
---

FYI [~mengxr] [~josephkb]

> PySpark JavaModel fails to extract params from Spark side automatically
> ---
>
> Key: SPARK-13368
> URL: https://issues.apache.org/jira/browse/SPARK-13368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> JavaModel fails to extract params from Spark side automatically that causes 
> model.extractParamMap() is always empty. As shown in the example code below 
> copied from Spark Guide 
> https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param
> {code}
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique
> # IDs for this LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> {code}
> The result of model1.extractParamMap() is {}.
> Question is, should we provide the feature or not? If yes, we need either let 
> Model share same params with Estimator or adds a parent in Model and points 
> to its Estimator; if not, we should remove those lines from example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-02-17 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-13368:
--
Description: 
JavaModel fails to extract params from Spark side automatically that causes 
model.extractParamMap() is always empty. As shown in the example code below 
copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html

{code}
# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique
# IDs for this LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()
{code}

The result of model1.extractParamMap() is {}.

Question is, should we provide the feature or not? If yes, we need either let 
Model share same params with Estimator or adds a parent in Model and points to 
its Estimator; if not, we should remove those lines from example code.

  was:
JavaModel fails to extract params from Spark side automatically that causes 
model.extractParamMap() is always empty. As shown in the example code below 
copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html:

{code}
# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique
# IDs for this LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()
{code}

The result of model1.extractParamMap() is {}.

Question is, should we provide the feature or not? If yes, we need either let 
Model share same params with Estimator or adds a parent in Model and points to 
its Estimator; if not, we should remove those lines from example code.


> PySpark JavaModel fails to extract params from Spark side automatically
> ---
>
> Key: SPARK-13368
> URL: https://issues.apache.org/jira/browse/SPARK-13368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> JavaModel fails to extract params from Spark side automatically that causes 
> model.extractParamMap() is always empty. As shown in the example code below 
> copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html
> {code}
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique
> # IDs for this

[jira] [Updated] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-02-17 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-13368:
--
Description: 
JavaModel fails to extract params from Spark side automatically that causes 
model.extractParamMap() is always empty. As shown in the example code below 
copied from Spark Guide 
https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param
{code}
# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique
# IDs for this LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()
{code}

The result of model1.extractParamMap() is {}.

Question is, should we provide the feature or not? If yes, we need either let 
Model share same params with Estimator or adds a parent in Model and points to 
its Estimator; if not, we should remove those lines from example code.

  was:
JavaModel fails to extract params from Spark side automatically that causes 
model.extractParamMap() is always empty. As shown in the example code below 
copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html

{code}
# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique
# IDs for this LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()
{code}

The result of model1.extractParamMap() is {}.

Question is, should we provide the feature or not? If yes, we need either let 
Model share same params with Estimator or adds a parent in Model and points to 
its Estimator; if not, we should remove those lines from example code.


> PySpark JavaModel fails to extract params from Spark side automatically
> ---
>
> Key: SPARK-13368
> URL: https://issues.apache.org/jira/browse/SPARK-13368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> JavaModel fails to extract params from Spark side automatically that causes 
> model.extractParamMap() is always empty. As shown in the example code below 
> copied from Spark Guide 
> https://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param
> {code}
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the

[jira] [Created] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-02-17 Thread Xusen Yin (JIRA)

Xusen Yin created SPARK-13368:
-

 Summary: PySpark JavaModel fails to extract params from Spark side 
automatically
 Key: SPARK-13368
 URL: https://issues.apache.org/jira/browse/SPARK-13368
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Xusen Yin


JavaModel fails to extract params from Spark side automatically that causes 
model.extractParamMap() is always empty. As shown in the example code below 
copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html:

{code}
# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique
# IDs for this LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()
{code}

The result of model1.extractParamMap() is {}.

Question is, should we provide the feature or not? If yes, we need either let 
Model share same params with Estimator or adds a parent in Model and points to 
its Estimator; if not, we should remove those lines from example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13368) PySpark JavaModel fails to extract params from Spark side automatically

2016-02-17 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-13368:
--
Priority: Minor  (was: Major)

> PySpark JavaModel fails to extract params from Spark side automatically
> ---
>
> Key: SPARK-13368
> URL: https://issues.apache.org/jira/browse/SPARK-13368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> JavaModel fails to extract params from Spark side automatically that causes 
> model.extractParamMap() is always empty. As shown in the example code below 
> copied from Spark Guide https://spark.apache.org/docs/latest/ml-guide.html:
> {code}
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique
> # IDs for this LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> {code}
> The result of model1.extractParamMap() is {}.
> Question is, should we provide the feature or not? If yes, we need either let 
> Model share same params with Estimator or adds a parent in Model and points 
> to its Estimator; if not, we should remove those lines from example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13324) Update plugin, test, example dependencies for 2.x

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13324.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11206
[https://github.com/apache/spark/pull/11206]

> Update plugin, test, example dependencies for 2.x
> -
>
> Key: SPARK-13324
> URL: https://issues.apache.org/jira/browse/SPARK-13324
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Examples, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> I'd like to update dependencies for 2.x as much as we can.
> To start, we can update the low-risk dependencies: build plugins, test 
> dependencies, third-party examples / integrations.
> Later I'll try and propose more significant updates to core dependencies, 
> excepting those that break support or compatibility significantly.
> Then we'll look at the tough updates separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13360) pyspark related enviroment variable is not propagated to driver in yarn-cluster mode

2016-02-17 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13360:
---
Description: Such as PYSPARK_DRIVER_PYTHON, PYSPARK_PYTHON, PYTHONHASHSEED.

> pyspark related enviroment variable is not propagated to driver in 
> yarn-cluster mode
> 
>
> Key: SPARK-13360
> URL: https://issues.apache.org/jira/browse/SPARK-13360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>
> Such as PYSPARK_DRIVER_PYTHON, PYSPARK_PYTHON, PYTHONHASHSEED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13360) pyspark related enviroment variable is not propagated to driver in yarn-cluster mode

2016-02-17 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13360:
---
Summary: pyspark related enviroment variable is not propagated to driver in 
yarn-cluster mode  (was: PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON is not picked 
up in yarn-cluster mode)

> pyspark related enviroment variable is not propagated to driver in 
> yarn-cluster mode
> 
>
> Key: SPARK-13360
> URL: https://issues.apache.org/jira/browse/SPARK-13360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13363) Aggregator not working with DataFrame

2016-02-17 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13363:
-
Priority: Blocker  (was: Minor)

> Aggregator not working with DataFrame
> -
>
> Key: SPARK-13363
> URL: https://issues.apache.org/jira/browse/SPARK-13363
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: koert kuipers
>Priority: Blocker
>
> org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class 
> for user-defined aggregations, which can be used in [[DataFrame]] and 
> [[Dataset]]
> it works well with Dataset/GroupedDataset, but i am having no luck using it 
> with DataFrame/GroupedData. does anyone have an example how to use it with a 
> DataFrame?
> in particular i would like to use it with this method in GroupedData:
> {noformat}
>   def agg(expr: Column, exprs: Column*): DataFrame
> {noformat}
> clearly it should be possible, since GroupedDataset uses that very same 
> method to do the work:
> {noformat}
>   private def agg(exprs: Column*): DataFrame =
> groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*)
> {noformat}
> the trick seems to be the wrapping in withEncoder, which is private. i tried 
> to do something like it myself, but i had no luck since it uses more private 
> stuff in TypedColumn.
> anyhow, my attempt at using it in DataFrame:
> {noformat}
> val simpleSum = new SqlAggregator[Int, Int, Int] {
>   def zero: Int = 0 // The initial value.
>   def reduce(b: Int, a: Int) = b + a// Add an element to the running total
>   def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values.
>   def finish(b: Int) = b// Return the final result.
> }.toColumn
> val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v")
> df.groupBy("k").agg(simpleSum).show
> {noformat}
> and the resulting error:
> {noformat}
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13363) Aggregator not working with DataFrame

2016-02-17 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13363:
-
Affects Version/s: (was: 2.0.0)
   1.6.0
 Target Version/s: 2.0.0

> Aggregator not working with DataFrame
> -
>
> Key: SPARK-13363
> URL: https://issues.apache.org/jira/browse/SPARK-13363
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: koert kuipers
>Priority: Minor
>
> org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class 
> for user-defined aggregations, which can be used in [[DataFrame]] and 
> [[Dataset]]
> it works well with Dataset/GroupedDataset, but i am having no luck using it 
> with DataFrame/GroupedData. does anyone have an example how to use it with a 
> DataFrame?
> in particular i would like to use it with this method in GroupedData:
> {noformat}
>   def agg(expr: Column, exprs: Column*): DataFrame
> {noformat}
> clearly it should be possible, since GroupedDataset uses that very same 
> method to do the work:
> {noformat}
>   private def agg(exprs: Column*): DataFrame =
> groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*)
> {noformat}
> the trick seems to be the wrapping in withEncoder, which is private. i tried 
> to do something like it myself, but i had no luck since it uses more private 
> stuff in TypedColumn.
> anyhow, my attempt at using it in DataFrame:
> {noformat}
> val simpleSum = new SqlAggregator[Int, Int, Int] {
>   def zero: Int = 0 // The initial value.
>   def reduce(b: Int, a: Int) = b + a// Add an element to the running total
>   def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values.
>   def finish(b: Int) = b// Return the final result.
> }.toColumn
> val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v")
> df.groupBy("k").agg(simpleSum).show
> {noformat}
> and the resulting error:
> {noformat}
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-17 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151604#comment-15151604
 ] 

dylanzhou edited comment on SPARK-13183 at 2/18/16 2:33 AM:


[~srowen] i donot know this is a memory leak problem, i get heap memory error 
java.lang.OutOfMemoryError:Java for heap space . When I try to increase driver 
memory, just streaming programs work a little longer, in my opinion byte[] 
objects cannot be reclaimed by the GC,These object cache is spark SQL table 
rows .When I increase the amount of data that flows into Kafka, memory 
consumption and faster。 Can you give me some advice? Here is my question, thank 
you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html


was (Author: dylanzhou):
@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount 
of data that flows into Kafka, memory consumption and faster。 Can you give me 
some advice? Here is my question, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html

> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-17 Thread dylanzhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dylanzhou updated SPARK-13183:
--
Comment: was deleted

(was: @Sean Owen maybe is a memory leak problem, and finally will run out of 
heap memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount 
of data that flows into Kafka, memory consumption and faster。 Can you give me 
some advice? Here is my question, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html
)

> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-17 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151604#comment-15151604
 ] 

dylanzhou commented on SPARK-13183:
---

@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC.When I increase the amount 
of data that flows into Kafka, memory consumption and faster。 Can you give me 
some advice? Here is my question, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html

> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job

2016-02-17 Thread Jon Maurer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151591#comment-15151591
 ] 

Jon Maurer commented on SPARK-10001:


I have a number of users who would find this feature to be useful. The related 
scenario is hitting ctrl-c by mistake as an attempt to copy output and 
accidentally killing the shell, even if a job has not yet been submitted. Is 
there other interest in this feature?

> Allow Ctrl-C in spark-shell to kill running job
> ---
>
> Key: SPARK-10001
> URL: https://issues.apache.org/jira/browse/SPARK-10001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Cheolsoo Park
>Priority: Minor
>
> Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if 
> spark-shell also can do that. Otherwise, in case a user submits a job, say he 
> made a mistake, and wants to cancel it, he needs to exit the shell and 
> re-login to continue his work. Re-login can be a pain especially in Spark on 
> yarn, since it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6263) Python MLlib API missing items: Utils

2016-02-17 Thread Bruno Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151508#comment-15151508
 ] 

Bruno Wu commented on SPARK-6263:
-

kFold function is still not available in util.py (as far as I can see). Can I 
work on this JIRA to add in kFold?

> Python MLlib API missing items: Utils
> -
>
> Key: SPARK-6263
> URL: https://issues.apache.org/jira/browse/SPARK-6263
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Kai Sasaki
> Fix For: 1.5.0
>
>
> This JIRA lists items missing in the Python API for this sub-package of MLlib.
> This list may be incomplete, so please check again when sending a PR to add 
> these features to the Python API.
> Also, please check for major disparities between documentation; some parts of 
> the Python API are less well-documented than their Scala counterparts.  Some 
> items may be listed in the umbrella JIRA linked to this task.
> MLUtils
> * appendBias
> * kFold
> * loadVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13367) Refactor KinesisUtils to specify more KCL options

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151485#comment-15151485
 ] 

Apache Spark commented on SPARK-13367:
--

User 'addisonj' has created a pull request for this issue:
https://github.com/apache/spark/pull/11245

> Refactor KinesisUtils to specify more KCL options
> -
>
> Key: SPARK-13367
> URL: https://issues.apache.org/jira/browse/SPARK-13367
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Addison Higham
>
> Currently, the KinesisUtils doesn't allow for configuring certain options, 
> such as the dynamoDB endpoint or cloudwatch options.
> This is a useful feature for being able to do local integration testing with 
> a tool like https://github.com/mhart/kinesalite and DynamoDB-Local.
> The code is also somewhat complicated as related configuration options are 
> passed independently and could be improved by a configuration object that 
> owns all the configuration concerns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13367) Refactor KinesisUtils to specify more KCL options

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13367:


Assignee: Apache Spark

> Refactor KinesisUtils to specify more KCL options
> -
>
> Key: SPARK-13367
> URL: https://issues.apache.org/jira/browse/SPARK-13367
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Addison Higham
>Assignee: Apache Spark
>
> Currently, the KinesisUtils doesn't allow for configuring certain options, 
> such as the dynamoDB endpoint or cloudwatch options.
> This is a useful feature for being able to do local integration testing with 
> a tool like https://github.com/mhart/kinesalite and DynamoDB-Local.
> The code is also somewhat complicated as related configuration options are 
> passed independently and could be improved by a configuration object that 
> owns all the configuration concerns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13367) Refactor KinesisUtils to specify more KCL options

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13367:


Assignee: (was: Apache Spark)

> Refactor KinesisUtils to specify more KCL options
> -
>
> Key: SPARK-13367
> URL: https://issues.apache.org/jira/browse/SPARK-13367
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Addison Higham
>
> Currently, the KinesisUtils doesn't allow for configuring certain options, 
> such as the dynamoDB endpoint or cloudwatch options.
> This is a useful feature for being able to do local integration testing with 
> a tool like https://github.com/mhart/kinesalite and DynamoDB-Local.
> The code is also somewhat complicated as related configuration options are 
> passed independently and could be improved by a configuration object that 
> owns all the configuration concerns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13344) Tests have many "accumulator not found" exceptions

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13344.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11222
[https://github.com/apache/spark/pull/11222]

> Tests have many "accumulator not found" exceptions
> --
>
> Key: SPARK-13344
> URL: https://issues.apache.org/jira/browse/SPARK-13344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is because SparkFunSuite clears all accumulators after every single 
> test. This suite reuses a DF and all of its associated internal accumulators 
> across many tests. E.g. SaveLoadSuite, InnerJoinSuite, many others.
> This is likely caused by SPARK-10620.
> {code}
> 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
> unregistered accumulator 253 when reconstructing task metrics.
> 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
> accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 253
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13367) Refactor KinesisUtils to specify more KCL options

2016-02-17 Thread Addison Higham (JIRA)

Addison Higham created SPARK-13367:
--

 Summary: Refactor KinesisUtils to specify more KCL options
 Key: SPARK-13367
 URL: https://issues.apache.org/jira/browse/SPARK-13367
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
Reporter: Addison Higham


Currently, the KinesisUtils doesn't allow for configuring certain options, such 
as the dynamoDB endpoint or cloudwatch options.

This is a useful feature for being able to do local integration testing with a 
tool like https://github.com/mhart/kinesalite and DynamoDB-Local.

The code is also somewhat complicated as related configuration options are 
passed independently and could be improved by a configuration object that owns 
all the configuration concerns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2016-02-17 Thread Henry Saputra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151406#comment-15151406
 ] 

Henry Saputra commented on SPARK-2541:
--

Based on discussion on
https://github.com/apache/spark/pull/2320

Seemed like we should not close this as dup of 
https://issues.apache.org/jira/browse/SPARK-3438

This should cover case where a standalone cluster is used to access secure HDFS 
for single user scenario. 

> Standalone mode can't access secure HDFS anymore
> 
>
> Key: SPARK-2541
> URL: https://issues.apache.org/jira/browse/SPARK-2541
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Thomas Graves
> Attachments: SPARK-2541-partial.patch
>
>
> In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
> doesn't work in 1.X anymore. 
> It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
> wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
> when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12953.

   Resolution: Fixed
Fix Version/s: 2.0.0

Fixed by PR for 2.0.0.

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: shijinkui
>Assignee: shijinkui
>Priority: Minor
> Fix For: 2.0.0
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12953:
---
Assignee: shijinkui

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: shijinkui
>Assignee: shijinkui
>Priority: Minor
> Fix For: 2.0.0
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-12953:


> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: shijinkui
>Assignee: shijinkui
>Priority: Minor
> Fix For: 2.0.0
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13109) SBT publishLocal failed to publish to local ivy repo

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13109.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11001
[https://github.com/apache/spark/pull/11001]

> SBT publishLocal failed to publish to local ivy repo
> 
>
> Key: SPARK-13109
> URL: https://issues.apache.org/jira/browse/SPARK-13109
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
> Fix For: 2.0.0
>
>
> Because of overriding the {{external-resolvers}} to Maven central and local 
> only in sbt, now  {{sbt publishLocal}} is failed to publish to local ivy 
> repo, the detailed exception is showing in the dev mail list 
> (http://apache-spark-developers-list.1001551.n3.nabble.com/sbt-publish-local-fails-with-2-0-0-SNAPSHOT-td16168.html).
> Possibly two solutions:
> 1. Add ivy local repo to the {{external-resolvers}}.
> 2. Do not publish to local ivy repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13109) SBT publishLocal failed to publish to local ivy repo

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13109:
---
Assignee: Saisai Shao

> SBT publishLocal failed to publish to local ivy repo
> 
>
> Key: SPARK-13109
> URL: https://issues.apache.org/jira/browse/SPARK-13109
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.0
>
>
> Because of overriding the {{external-resolvers}} to Maven central and local 
> only in sbt, now  {{sbt publishLocal}} is failed to publish to local ivy 
> repo, the detailed exception is showing in the dev mail list 
> (http://apache-spark-developers-list.1001551.n3.nabble.com/sbt-publish-local-fails-with-2-0-0-SNAPSHOT-td16168.html).
> Possibly two solutions:
> 1. Add ivy local repo to the {{external-resolvers}}.
> 2. Do not publish to local ivy repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Max Seiden (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151368#comment-15151368
 ] 

Max Seiden commented on SPARK-12449:


Very interested in checking out that PR! It would be prudent to have a holistic 
high-level design for any work here too, mostly to answer a few major 
questions. A random sample of such Qs:

+ Should there be a new trait for each new `sources.*` type, or a single trait 
that communicates capabilities to the planner (i.e. the CatalystSource design)?
  a) a new trait for each source could get unwieldy given the potential # 
of permutations
  b) a single, generic trait is powerful, but it puts a lot of burden on 
the implementer to cover more cases than they may want
 
+ Depending on the above, should source plans be a tree of operators or a list 
of operators to be applied in-order?
  a) the first option is more natural, but is smells a lot like catalyst -- 
not a bad thing if it's a separate, stable API though

+ the more that's pushed down via sources.Expressions, the more complex things 
may get for implementers 
  a) for example, if Aliases are pushed down, there's a lot more 
opportunity for resolution bugs in the source impl
  b) a definitive stance would be needed for exprs like UDFs or those 
dealing with complex types
  c) without a way to signal capabilities (implicitly or explicitly) to the 
planner, there'd likely need to be a way to "bail out"

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Evan Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151345#comment-15151345
 ] 

Evan Chan commented on SPARK-12449:
---

[~stephank85] would you have any code to share?  :D

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Stephan Kessler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151339#comment-15151339
 ] 

Stephan Kessler commented on SPARK-12449:
-

[~maxseiden] good idea! In order to simplify things even more - we could get 
rid (at least in the first shot) of the partitioned and holistic approach, 
since we aim for databases as datasources. What do you think on keeping the 
ability to kind of ask the datasource if it supports the pushdown of a 
well-defined operation? This would simplify the implementation of the 
datasource as well as the Strategy for the planner.

[~velvia] i am currently working heavily on the pushdown of partial aggregates 
in combination with Tungsten, so i am happy to contribute in that direction.

Should i try to formulate a new/simplified design doc that covers the gradual 
approach? I am very happy to help with the PR and the definitions of tasks as 
well.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Xiu (Joe) Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiu (Joe) Guo updated SPARK-13366:
--
Description: 
Saw a comment from [~marmbrus] regarding Cartesian join for Datasets:

"You will get a cartesian if you do a join/joinWith using lit(true) as the 
condition.  We could consider adding an API for doing that more concisely."

  was:
Saw a comment from [~marmbrus] about this:

"You will get a cartesian if you do a join/joinWith using lit(true) as the 
condition.  We could consider adding an API for doing that more concisely."


> Support Cartesian join for Datasets
> ---
>
> Key: SPARK-13366
> URL: https://issues.apache.org/jira/browse/SPARK-13366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiu (Joe) Guo
>Priority: Minor
>
> Saw a comment from [~marmbrus] regarding Cartesian join for Datasets:
> "You will get a cartesian if you do a join/joinWith using lit(true) as the 
> condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151331#comment-15151331
 ] 

Apache Spark commented on SPARK-13366:
--

User 'xguo27' has created a pull request for this issue:
https://github.com/apache/spark/pull/11244

> Support Cartesian join for Datasets
> ---
>
> Key: SPARK-13366
> URL: https://issues.apache.org/jira/browse/SPARK-13366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiu (Joe) Guo
>Priority: Minor
>
> Saw a comment from [~marmbrus] about this:
> "You will get a cartesian if you do a join/joinWith using lit(true) as the 
> condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13366:


Assignee: Apache Spark

> Support Cartesian join for Datasets
> ---
>
> Key: SPARK-13366
> URL: https://issues.apache.org/jira/browse/SPARK-13366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiu (Joe) Guo
>Assignee: Apache Spark
>Priority: Minor
>
> Saw a comment from [~marmbrus] about this:
> "You will get a cartesian if you do a join/joinWith using lit(true) as the 
> condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13366:


Assignee: (was: Apache Spark)

> Support Cartesian join for Datasets
> ---
>
> Key: SPARK-13366
> URL: https://issues.apache.org/jira/browse/SPARK-13366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiu (Joe) Guo
>Priority: Minor
>
> Saw a comment from [~marmbrus] about this:
> "You will get a cartesian if you do a join/joinWith using lit(true) as the 
> condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Xiu (Joe) Guo (JIRA)

Xiu (Joe) Guo created SPARK-13366:
-

 Summary: Support Cartesian join for Datasets
 Key: SPARK-13366
 URL: https://issues.apache.org/jira/browse/SPARK-13366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiu (Joe) Guo
Priority: Minor


Saw a comment from Michael about this:

"You will get a cartesian if you do a join/joinWith using lit(true) as the 
condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12224) R support for JDBC source

2016-02-17 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151298#comment-15151298
 ] 

Felix Cheung commented on SPARK-12224:
--

[~shivaram] could you please review the PR comment
https://github.com/apache/spark/pull/10480#discussion_r50348037


> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13242) Moderately complex `when` expression causes code generation failure

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151291#comment-15151291
 ] 

Apache Spark commented on SPARK-13242:
--

User 'joehalliwell' has created a pull request for this issue:
https://github.com/apache/spark/pull/11243

> Moderately complex `when` expression causes code generation failure
> ---
>
> Key: SPARK-13242
> URL: https://issues.apache.org/jira/browse/SPARK-13242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joe Halliwell
>
> Moderately complex `when` expressions produce generated code that busts the 
> 64KB method limit. This causes code generation to fail.
> Here's a test case exhibiting the problem: 
> https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d
> I'm interested in working on a fix. I'm thinking it may be possible to split 
> the expressions along the lines of SPARK-8443, but any pointers would be 
> welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13344) Tests have many "accumulator not found" exceptions

2016-02-17 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13344:
--
Summary: Tests have many "accumulator not found" exceptions  (was: 
SaveLoadSuite has many accumulator exceptions)

> Tests have many "accumulator not found" exceptions
> --
>
> Key: SPARK-13344
> URL: https://issues.apache.org/jira/browse/SPARK-13344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is because SparkFunSuite clears all accumulators after every single 
> test. This suite reuses a DF and all of its associated internal accumulators 
> across many tests.
> This is likely caused by SPARK-10620.
> {code}
> 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
> unregistered accumulator 253 when reconstructing task metrics.
> 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
> accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 253
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13344) Tests have many "accumulator not found" exceptions

2016-02-17 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13344:
--
Description: 
This is because SparkFunSuite clears all accumulators after every single test. 
This suite reuses a DF and all of its associated internal accumulators across 
many tests. E.g. SaveLoadSuite, InnerJoinSuite, many others.

This is likely caused by SPARK-10620.

{code}
10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
unregistered accumulator 253 when reconstructing task metrics.
10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
accumulators for task 0
org.apache.spark.SparkException: attempted to access non-existent accumulator 
253
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
{code}

  was:
This is because SparkFunSuite clears all accumulators after every single test. 
This suite reuses a DF and all of its associated internal accumulators across 
many tests.

This is likely caused by SPARK-10620.

{code}
10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
unregistered accumulator 253 when reconstructing task metrics.
10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
accumulators for task 0
org.apache.spark.SparkException: attempted to access non-existent accumulator 
253
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
{code}


> Tests have many "accumulator not found" exceptions
> --
>
> Key: SPARK-13344
> URL: https://issues.apache.org/jira/browse/SPARK-13344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is because SparkFunSuite clears all accumulators after every single 
> test. This suite reuses a DF and all of its associated internal accumulators 
> across many tests. E.g. SaveLoadSuite, InnerJoinSuite, many others.
> This is likely caused by SPARK-10620.
> {code}
> 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
> unregistered accumulator 253 when reconstructing task metrics.
> 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
> accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 253
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13279) Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)

2016-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13279:
--
Fix Version/s: (was: 1.7)
   2.0.0

> Scheduler does O(N^2) operation when adding a new task set (making it 
> prohibitively slow for scheduling 200K tasks)
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 1.6.1, 2.0.0
>
>
> For each task that the TaskSetManager adds, it iterates through the entire 
> list of existing tasks to check if it's there.  As a result, scheduling a new 
> task set is O(N^2), which can be slow for large task sets.
> This is a bug that was introduced by 
> https://github.com/apache/spark/commit/3535b91: that commit removed the 
> "!readding" condition from the if-statement, but since the re-adding 
> parameter defaulted to false, that commit should have removed the condition 
> check in the if-statement altogether.
> -
> We discovered this bug while running a large pipeline with 200k tasks, when 
> we found that the executors were not able to register with the driver because 
> the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks 
> function for a long time (it wasn't deadlocked -- just taking a long time). 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Evan Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151268#comment-15151268
 ] 

Evan Chan commented on SPARK-12449:
---

I agree with [~maxseiden] on a gradual approach to push more down into the data 
sources API.Since I was going to explore a path like this anyways, I'd be 
willing to submit a PR to explore a `sources.Expression` kind of pushdown.

There is also some stuff in 2.0 that might interact with this, such as 
vectorization and the whole query code gen, that we need to be aware of.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151265#comment-15151265
 ] 

Xiao Li commented on SPARK-1:
-

Another example is MS SQL Server Rand()

https://msdn.microsoft.com/en-us/library/ms177610.aspx

{code}
seed
Is an integer expression (tinyint, smallint, or int) that gives the seed value. 
If seed is not specified, the SQL Server Database Engine assigns a seed value 
at random. For a specified seed value, the result returned is always the same.
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13365) should coalesce do anything if coalescing to same number of partitions without shuffle

2016-02-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-13365:
-

 Summary: should coalesce do anything if coalescing to same number 
of partitions without shuffle
 Key: SPARK-13365
 URL: https://issues.apache.org/jira/browse/SPARK-13365
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Thomas Graves


Currently if a user does a coalesce to the same number of partitions as already 
exist it spends a bunch of time doing stuff when it seems like it shouldn't do 
anything.

for instance I have an RDD with 100 partitions if I run coalesce(100) it seems 
like it should skip any computation since it already has 100 partitions.  One 
case I've seen this is actually when users do coalesce(1000) without the 
shuffle which really turns into a coalesce(100).

I'm presenting this as a question as I'm not sure if there are use cases I 
haven't thought of where this would break.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13279) Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)

2016-02-17 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-13279.

   Resolution: Fixed
Fix Version/s: 1.6.1
   1.7

> Scheduler does O(N^2) operation when adding a new task set (making it 
> prohibitively slow for scheduling 200K tasks)
> ---
>
> Key: SPARK-13279
> URL: https://issues.apache.org/jira/browse/SPARK-13279
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 1.7, 1.6.1
>
>
> For each task that the TaskSetManager adds, it iterates through the entire 
> list of existing tasks to check if it's there.  As a result, scheduling a new 
> task set is O(N^2), which can be slow for large task sets.
> This is a bug that was introduced by 
> https://github.com/apache/spark/commit/3535b91: that commit removed the 
> "!readding" condition from the if-statement, but since the re-adding 
> parameter defaulted to false, that commit should have removed the condition 
> check in the if-statement altogether.
> -
> We discovered this bug while running a large pipeline with 200k tasks, when 
> we found that the executors were not able to register with the driver because 
> the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks 
> function for a long time (it wasn't deadlocked -- just taking a long time). 
> jstack of the driver - http://pastebin.com/m8CP6VMv
> executor log - http://pastebin.com/2NPS1mXC
> From the jstack I see that the thread handing the resource offer from 
> executors (dispatcher-event-loop-9) is blocked on a lock held by the thread 
> "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer 
> when adding a pending tasks. So when we have 200k pending tasks, because of 
> this o(n2) operations, the driver is just hung for more than 5 minutes. 
> Solution -   In addPendingTask function, we don't really need a duplicate 
> check. It's okay if we add a task to the same queue twice because 
> dequeueTaskFromList will skip already-running tasks. 
> Please note that this is a regression from Spark 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-02-17 Thread Stephanie Bodoff (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151052#comment-15151052
 ] 

Stephanie Bodoff edited comment on SPARK-13275 at 2/17/16 9:24 PM:
---

It's a UI problem (see the screenshot webui.png). The left edge of the executor 
boxes should be to the right of the left edge of the job box so that they look 
like they are started after the job. Alternatively, the left edge of the job 
box could be drawn more to the left. 


was (Author: sbodoff):
It's a UI problem. The left edge of the executor boxes should be to the right 
of the left edge of the job box so that they look like they are started after 
the job. Alternatively, the left edge of the job box could be drawn more to the 
left. 

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Priority: Minor
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table

2016-02-17 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151211#comment-15151211
 ] 

Ryan Blue commented on SPARK-9926:
--

I've just posted [PR #11242|https://github.com/apache/spark/pull/11242] that 
fixes this issue using the UnionRDD fix, but doesn't include the code for 
SPARK-10340, which is addressed by HADOOP-12810.

> Parallelize file listing for partitioned Hive table
> ---
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very 
> slowly against partitioned Hive tables because of file listing. In 
> particular, if a large number of partitions are scanned on storage like S3, 
> the queries run extremely slowly. Here are some example benchmarks in my 
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
> partition path and group them into a UnionRDD. Then, all the input files are 
> listed sequentially. In other tools such as Hive and Pig, this can be solved 
> by setting 
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
>  high. But in Spark, since each HadoopRDD lists only one partition path, 
> setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151208#comment-15151208
 ] 

Apache Spark commented on SPARK-9926:
-

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/11242

> Parallelize file listing for partitioned Hive table
> ---
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very 
> slowly against partitioned Hive tables because of file listing. In 
> particular, if a large number of partitions are scanned on storage like S3, 
> the queries run extremely slowly. Here are some example benchmarks in my 
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
> partition path and group them into a UnionRDD. Then, all the input files are 
> listed sequentially. In other tools such as Hive and Pig, this can be solved 
> by setting 
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
>  high. But in Spark, since each HadoopRDD lists only one partition path, 
> setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13364) history server application column not sorting properly

2016-02-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-13364:
-

 Summary: history server application column not sorting properly
 Key: SPARK-13364
 URL: https://issues.apache.org/jira/browse/SPARK-13364
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Thomas Graves


The new history server is using datatables, the application column isn't 
sorting them properly. Its not sorting the last _X part right. below is an 
example where the 30174 should be before 30149
application_1453493359692_30149 
application_1453493359692_30174

I'm guessing its sorting used the  string rather then just the 
application id.
application_1453493359692_30029



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Max Seiden (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151116#comment-15151116
 ] 

Max Seiden commented on SPARK-12449:


[~rxin] Given that predicate pushdown via `sources.Filter` is (afaik) a stable 
API, conceivably that model could be extended to support ever richer operations 
(i.e. sources.Expression, sources.Limit, sources.Join, sources.Aggregation). In 
this case, the stable APIs remain a derivative of the Catalyst plans and all 
that needs to change between releases is the compilation from Catalyst => 
Sources. 

cc [~marmbrus] since we talked briefly about this idea in person at Spark Summit

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13363) Aggregator not working with DataFrame

2016-02-17 Thread koert kuipers (JIRA)

koert kuipers created SPARK-13363:
-

 Summary: Aggregator not working with DataFrame
 Key: SPARK-13363
 URL: https://issues.apache.org/jira/browse/SPARK-13363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: koert kuipers
Priority: Minor


org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class for 
user-defined aggregations, which can be used in [[DataFrame]] and [[Dataset]]

it works well with Dataset/GroupedDataset, but i am having no luck using it 
with DataFrame/GroupedData. does anyone have an example how to use it with a 
DataFrame?

in particular i would like to use it with this method in GroupedData:
{noformat}
  def agg(expr: Column, exprs: Column*): DataFrame
{noformat}

clearly it should be possible, since GroupedDataset uses that very same method 
to do the work:
{noformat}
  private def agg(exprs: Column*): DataFrame =
groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*)
{noformat}

the trick seems to be the wrapping in withEncoder, which is private. i tried to 
do something like it myself, but i had no luck since it uses more private stuff 
in TypedColumn.

anyhow, my attempt at using it in DataFrame:
{noformat}
val simpleSum = new SqlAggregator[Int, Int, Int] {
  def zero: Int = 0 // The initial value.
  def reduce(b: Int, a: Int) = b + a// Add an element to the running total
  def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values.
  def finish(b: Int) = b// Return the final result.
}.toColumn

val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v")
df.groupBy("k").agg(simpleSum).show
{noformat}

and the resulting error:
{noformat}
org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [k#104], 
[k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13350) Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python"

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13350.

   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 11239
[https://github.com/apache/spark/pull/11239]

> Configuration documentation incorrectly states that PYSPARK_PYTHON's default 
> is "python"
> 
>
> Key: SPARK-13350
> URL: https://issues.apache.org/jira/browse/SPARK-13350
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Christopher Aycock
>Assignee: Christopher Aycock
>Priority: Trivial
>  Labels: newbie
> Fix For: 2.0.0, 1.6.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The configuration documentation states that the environment variable 
> PYSPARK_PYTHON has a default value of {{python}}:
> http://spark.apache.org/docs/latest/configuration.html
> In fact, the default is {{python2.7}}:
> https://github.com/apache/spark/blob/4f60651cbec1b4c9cc2e6d832ace77e89a233f3a/bin/pyspark#L39-L45
> The change that introduced this was discussed here:
> https://github.com/apache/spark/pull/2651
> Would it be possible to highlight this in the documentation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13350) Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python"

2016-02-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13350:
---
Assignee: Christopher Aycock

> Configuration documentation incorrectly states that PYSPARK_PYTHON's default 
> is "python"
> 
>
> Key: SPARK-13350
> URL: https://issues.apache.org/jira/browse/SPARK-13350
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Christopher Aycock
>Assignee: Christopher Aycock
>Priority: Trivial
>  Labels: newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The configuration documentation states that the environment variable 
> PYSPARK_PYTHON has a default value of {{python}}:
> http://spark.apache.org/docs/latest/configuration.html
> In fact, the default is {{python2.7}}:
> https://github.com/apache/spark/blob/4f60651cbec1b4c9cc2e6d832ace77e89a233f3a/bin/pyspark#L39-L45
> The change that introduced this was discussed here:
> https://github.com/apache/spark/pull/2651
> Would it be possible to highlight this in the documentation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151015#comment-15151015
 ] 

Sean Owen commented on SPARK-9273:
--

No, I mean that I expect it will start life as an external package. In any 
event, I want to at least convene any related discussion in 1 JIRA, not N. CNN 
is a type of ANN so I don't see a value in discussing it separately.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-17 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150960#comment-15150960
 ] 

Alexander Ulanov commented on SPARK-9273:
-

[~srowen] Do you mean that CNN will never be merged into Spark ML?

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9844) File appender race condition during SparkWorker shutdown

2016-02-17 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150913#comment-15150913
 ] 

Bryan Cutler commented on SPARK-9844:
-

This error is benign for the most part, once it gets here, the worker is 
already being shut down.  So it is probably something else that is causing your 
worker to shut down.

> File appender race condition during SparkWorker shutdown
> 
>
> Key: SPARK-9844
> URL: https://issues.apache.org/jira/browse/SPARK-9844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Alex Liu
>Assignee: Bryan Cutler
> Fix For: 1.6.1, 2.0.0
>
>
> We find this issue still exists in 1.3.1
> {code}
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - Error writing stream to file 
> /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - java.io.IOException: Stream closed
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.FilterInputStream.read(FilterInputStream.java:107) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
>  ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) 
> [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,656 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 
> [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> {code}
> at  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L159
> The process auto shuts down, but the log appenders are still running, which 
> causes the error log messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13349) adding a split and union to a streaming application cause big performance hit

2016-02-17 Thread krishna ramachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150893#comment-15150893
 ] 

krishna ramachandran commented on SPARK-13349:
--

i have simple synthetic example below. created 2 "raw streams" and job 1 is 
materialized when stream1 is output (some action print/save)

In job1 
val stream1 = ssc.union(rawStreams).filter(_.contains("Stream:first"))
save.stream1
...
..

job2 create another split using rawStreams and union with stream1

   val stream2 = ssc.union(rawStreams).filter(_.contains("Batch:second"))
   val stream3 = stream1.union(stream2)
   ..
   save.stream3

job2 is materialized and executed. This pattern is executed for every batch
Looking at visual DAG I see, job1 executes first graph and job2 computes both 
"stream1" and "stream2"

Caching DStream stream1 (result from job1) makes job2 go almost twice as fast

In our real app, we have 7 such jobs per batch and typically we union output of 
job5 with job1. That is, union output of 1 with stream generated during job5. 
Caching and reusing output of job1 (stream1) is very efficient (per batch 
execution is 2.5 times faster) - but we start seeing out of memory errors

I would like to be able to "unpersist" stream1 after the union (for that batch)


> adding a split and union to a streaming application cause big performance hit
> -
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
> Fix For: 1.4.2
>
>
> We have a streaming application containing approximately 12 jobs every batch, 
> running in streaming mode (4 sec batches). Each job writes output to cassandra
> each job can contain several stages.
> job 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more jobs of transforms and save to database. 
> Around stage 5, we union the output of Dstream from job 1 (in red) with 
> another stream (generated by split during job 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream from job 1(red block output), performance improves substantially but 
> hit out of memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-02-17 Thread Alexandru Rosianu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandru Rosianu reopened SPARK-12675:
---

Reopening because other users are still reporting this.

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>   at 
>

[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-02-17 Thread Sven Krasser (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150872#comment-15150872
 ] 

Sven Krasser commented on SPARK-12675:
--

More findings (Spark 1.6.0): For our initial 200 partition use case, reducing 
it to 2 partitions temporarily suppressed the problems.

After adding a couple of additional joins into the dataflow, we now even see 
this issue with just 2 partitions. My suspicion is that stages with empty tasks 
contribute to this condition occurring.

[~aluxian], can you reopen? As reporter you should have permissions to do that. 
Thanks everyone!

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at

[jira] [Commented] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150863#comment-15150863
 ] 

Apache Spark commented on SPARK-13328:
--

User 'nezihyigitbasi' has created a pull request for this issue:
https://github.com/apache/spark/pull/11241

> Possible poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13328:


Assignee: Apache Spark

> Possible poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>Assignee: Apache Spark
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation

2016-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13328:


Assignee: (was: Apache Spark)

> Possible poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9844) File appender race condition during SparkWorker shutdown

2016-02-17 Thread Marcelo Balloni Gomes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150862#comment-15150862
 ] 

Marcelo Balloni Gomes commented on SPARK-9844:
--

 Is there any way of avoiding this error in 1.4.1? All my workers are being 
shut down a few times a day.
 I was thinking about turning off the log appenders, however I'm not sure that 
it would be of any help.

> File appender race condition during SparkWorker shutdown
> 
>
> Key: SPARK-9844
> URL: https://issues.apache.org/jira/browse/SPARK-9844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Alex Liu
>Assignee: Bryan Cutler
> Fix For: 1.6.1, 2.0.0
>
>
> We find this issue still exists in 1.3.1
> {code}
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - Error writing stream to file 
> /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - java.io.IOException: Stream closed
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.FilterInputStream.read(FilterInputStream.java:107) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
>  ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) 
> [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,656 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) 
> [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> {code}
> at  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L159
> The process auto shuts down, but the log appenders are still running, which 
> causes the error log messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2016-02-17 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150799#comment-15150799
 ] 

Ryan Blue commented on SPARK-10340:
---

>From discussion on the pull request, it looks like the solution is to add 
>parallelism to UnionRDD rather than use the Qubole approach of listing a 
>common prefix and filtering. I'm also linking to HADOOP-12810, which fixes a 
>problem in FileSystem that was causing a FileStatus to be fetched for each 
>file. The performance numbers show that both fixes address the performance 
>problem without adding an alternate code path that avoids delegating to the 
>InputFormat for split calculations on S3. I think the way forward is to get 
>the UnionRDD patch committed and backport the fix for FileSystem.

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-02-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13322:
--
Target Version/s: 2.0.0

> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.
> Another benefit is that standardization will improve the convergence rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-02-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13322:
--
Assignee: Yanbo Liang

> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.
> Another benefit is that standardization will improve the convergence rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space

2016-02-17 Thread Mohit Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150683#comment-15150683
 ] 

Mohit Garg commented on SPARK-13362:


thanks.

> Build Error: java.lang.OutOfMemoryError: PermGen space
> --
>
> Key: SPARK-13362
> URL: https://issues.apache.org/jira/browse/SPARK-13362
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 
> OS Type: 64-bit
> OS: Ubuntu 15.04
> JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode)
> Java: version 1.7.0_79, vendor Oracle Corporation
>Reporter: Mohit Garg
>Priority: Trivial
>  Labels: build
> Attachments: Error.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> While building spark from source:
> > git clone https://github.com/apache/spark.git
> > cd spark
> > mvn -DskipTests clean package -e
> ERROR:
> [ERROR] PermGen space -> [Help 1]
> java.lang.OutOfMemoryError: PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.getDeclaredMethods0(Native Method)
>   at java.lang.Class.privateGetDeclaredMethods(Class.java:2615)
>   at java.lang.Class.getDeclaredMethod(Class.java:2007)
>   at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235)
>   at 
> xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222)
>   at scala.tools.nsc.Global$Run.(Global.scala:1237)
>   at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113)
>   at xsbt.CachedCompiler0.run(CompilerInterface.scala:113)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space

2016-02-17 Thread Mohit Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Garg updated SPARK-13362:
---
Attachment: Error.png

VisualVM snapshot

> Build Error: java.lang.OutOfMemoryError: PermGen space
> --
>
> Key: SPARK-13362
> URL: https://issues.apache.org/jira/browse/SPARK-13362
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 
> OS Type: 64-bit
> OS: Ubuntu 15.04
> JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode)
> Java: version 1.7.0_79, vendor Oracle Corporation
>Reporter: Mohit Garg
>Priority: Trivial
>  Labels: build
> Attachments: Error.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> While building spark from source:
> > git clone https://github.com/apache/spark.git
> > cd spark
> > mvn -DskipTests clean package -e
> ERROR:
> [ERROR] PermGen space -> [Help 1]
> java.lang.OutOfMemoryError: PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.getDeclaredMethods0(Native Method)
>   at java.lang.Class.privateGetDeclaredMethods(Class.java:2615)
>   at java.lang.Class.getDeclaredMethod(Class.java:2007)
>   at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235)
>   at 
> xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222)
>   at scala.tools.nsc.Global$Run.(Global.scala:1237)
>   at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113)
>   at xsbt.CachedCompiler0.run(CompilerInterface.scala:113)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space

2016-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13362.
---
   Resolution: Not A Problem
Fix Version/s: (was: 1.5.2)

Please read the build docs. You didn't set your MAVEN_OPTS.

Please also read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA

> Build Error: java.lang.OutOfMemoryError: PermGen space
> --
>
> Key: SPARK-13362
> URL: https://issues.apache.org/jira/browse/SPARK-13362
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 
> OS Type: 64-bit
> OS: Ubuntu 15.04
> JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode)
> Java: version 1.7.0_79, vendor Oracle Corporation
>Reporter: Mohit Garg
>Priority: Trivial
>  Labels: build
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> While building spark from source:
> > git clone https://github.com/apache/spark.git
> > cd spark
> > mvn -DskipTests clean package -e
> ERROR:
> [ERROR] PermGen space -> [Help 1]
> java.lang.OutOfMemoryError: PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.getDeclaredMethods0(Native Method)
>   at java.lang.Class.privateGetDeclaredMethods(Class.java:2615)
>   at java.lang.Class.getDeclaredMethod(Class.java:2007)
>   at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235)
>   at 
> xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222)
>   at scala.tools.nsc.Global$Run.(Global.scala:1237)
>   at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113)
>   at xsbt.CachedCompiler0.run(CompilerInterface.scala:113)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space

2016-02-17 Thread Mohit Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Garg updated SPARK-13362:
---
Issue Type: Bug  (was: Improvement)

> Build Error: java.lang.OutOfMemoryError: PermGen space
> --
>
> Key: SPARK-13362
> URL: https://issues.apache.org/jira/browse/SPARK-13362
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 
> OS Type: 64-bit
> OS: Ubuntu 15.04
> JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode)
> Java: version 1.7.0_79, vendor Oracle Corporation
>Reporter: Mohit Garg
>Priority: Trivial
>  Labels: build
> Fix For: 1.5.2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> While building spark from source:
> > git clone https://github.com/apache/spark.git
> > cd spark
> > mvn -DskipTests clean package -e
> ERROR:
> [ERROR] PermGen space -> [Help 1]
> java.lang.OutOfMemoryError: PermGen space
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.getDeclaredMethods0(Native Method)
>   at java.lang.Class.privateGetDeclaredMethods(Class.java:2615)
>   at java.lang.Class.getDeclaredMethod(Class.java:2007)
>   at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235)
>   at 
> xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227)
>   at 
> xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222)
>   at scala.tools.nsc.Global$Run.(Global.scala:1237)
>   at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113)
>   at xsbt.CachedCompiler0.run(CompilerInterface.scala:113)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13362) Build Error: java.lang.OutOfMemoryError: PermGen space

2016-02-17 Thread Mohit Garg (JIRA)

Mohit Garg created SPARK-13362:
--

 Summary: Build Error: java.lang.OutOfMemoryError: PermGen space
 Key: SPARK-13362
 URL: https://issues.apache.org/jira/browse/SPARK-13362
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.5.2
 Environment: Processor: Intel® Core™ i5-5300U CPU @ 2.30GHz × 4 
OS Type: 64-bit
OS: Ubuntu 15.04
JVM: OpenJDK 64-Bit Server VM (24.79-b02, mixed mode)
Java: version 1.7.0_79, vendor Oracle Corporation
Reporter: Mohit Garg
Priority: Trivial
 Fix For: 1.5.2


While building spark from source:

> git clone https://github.com/apache/spark.git
> cd spark
> mvn -DskipTests clean package -e

ERROR:
[ERROR] PermGen space -> [Help 1]
java.lang.OutOfMemoryError: PermGen space
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2615)
at java.lang.Class.getDeclaredMethod(Class.java:2007)
at xsbt.CachedCompiler0$Compiler.superCall(CompilerInterface.scala:235)
at 
xsbt.CachedCompiler0$Compiler.superComputePhaseDescriptors(CompilerInterface.scala:230)
at 
xsbt.CachedCompiler0$Compiler.phaseDescriptors$lzycompute(CompilerInterface.scala:227)
at 
xsbt.CachedCompiler0$Compiler.phaseDescriptors(CompilerInterface.scala:222)
at scala.tools.nsc.Global$Run.(Global.scala:1237)
at xsbt.CachedCompiler0$$anon$2.(CompilerInterface.scala:113)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:113)

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2016-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150654#comment-15150654
 ] 

Apache Spark commented on SPARK-10759:
--

User 'JeremyNixon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11240

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9273.
--
Resolution: Duplicate

[~asimjalis] it's not going to happen (directly) in Spark anyway, but this is 
not different enough from the JIRA I linked to reopen this. Please don't. See 
other exact dupes that are linked to that issue.

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-9273.


> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 137 matches

Mail list logo