date:20160818

[jira] [Assigned] (SPARK-17072) generate table level stats:stats generation/storing/loading

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17072:


Assignee: (was: Apache Spark)

> generate table level stats:stats generation/storing/loading
> ---
>
> Key: SPARK-17072
> URL: https://issues.apache.org/jira/browse/SPARK-17072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> need to generating , storing, and loading statistics information into/from 
> meta store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17072) generate table level stats:stats generation/storing/loading

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427692#comment-15427692
 ] 

Apache Spark commented on SPARK-17072:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/14712

> generate table level stats:stats generation/storing/loading
> ---
>
> Key: SPARK-17072
> URL: https://issues.apache.org/jira/browse/SPARK-17072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> need to generating , storing, and loading statistics information into/from 
> meta store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17072) generate table level stats:stats generation/storing/loading

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17072:


Assignee: Apache Spark

> generate table level stats:stats generation/storing/loading
> ---
>
> Key: SPARK-17072
> URL: https://issues.apache.org/jira/browse/SPARK-17072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Apache Spark
>
> need to generating , storing, and loading statistics information into/from 
> meta store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-18 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427687#comment-15427687
 ] 

Felix Cheung commented on SPARK-16581:
--

I think JVM<->R is closely related to RBackend?

Because we are not trying to build a library to generically work with JVM from 
R (like py4j) but only the JVM that Spark is running, via custom socket 
protocol; it might come a time we want to operate from a R shell while working 
with multiple JVM backends (or remote backends), or want to have more control 
over recycling the backend process not completely dissimilar to cleanup.jobj, 
etc.

In addition to connect to a remote JVM, we might want to expose JVM side 
RBackend API to allow re-using an existing Spark JVM process (several Spark 
JIRAs in the past) for cases with Spark Job Server (persisted spark session), 
Apache Toree (incubating) / Livy (cross-languages support) (eg. 
https://issues.cloudera.org/projects/LIVY/issues/LIVY-194)

Possibly some of these could change how callJMethod/invokeJava works, what 
parameters are required and so on.

Of course, all of these could be very far off :)


> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15816) SQL server based on Postgres protocol

2016-08-18 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427671#comment-15427671
 ] 

Takeshi Yamamuro edited comment on SPARK-15816 at 8/19/16 6:07 AM:
---

[~sarutak][~dobashim] I just posted the design doc. and this is currently under 
the review of saruta-san and dobashi-san.


was (Author: maropu):
[~sarutak] I just posted the design doc. and this is currently under the review 
of saruta-san.

> SQL server based on Postgres protocol
> -
>
> Key: SPARK-15816
> URL: https://issues.apache.org/jira/browse/SPARK-15816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: New_SQL_Server_for_Spark.pdf
>
>
> At Spark Summit today this idea came up from a discussion: it would be great 
> to investigate the possibility of implementing a new SQL server using 
> Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket 
> to track this idea, in case others have feedback.
> This server can have a simpler architecture, and allows users to leverage a 
> wide range of tools that are already available for Postgres (and many 
> commercial database systems based on Postgres).
> Some of the problems we'd need to figure out are:
> 1. What is the Postgres protocol? Is there an official documentation for it?
> 2. How difficult would it be to implement that protocol in Spark (JVM in 
> particular).
> 3. How does data type mapping work?
> 4. How does system commands work? Would Spark need to support all of 
> Postgres' commands?
> 5. Any restrictions in supporting nested data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15816) SQL server based on Postgres protocol

2016-08-18 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427671#comment-15427671
 ] 

Takeshi Yamamuro commented on SPARK-15816:
--

[~sarutak] I just posted the design doc. and this is currently under the review 
of saruta-san.

> SQL server based on Postgres protocol
> -
>
> Key: SPARK-15816
> URL: https://issues.apache.org/jira/browse/SPARK-15816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: New_SQL_Server_for_Spark.pdf
>
>
> At Spark Summit today this idea came up from a discussion: it would be great 
> to investigate the possibility of implementing a new SQL server using 
> Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket 
> to track this idea, in case others have feedback.
> This server can have a simpler architecture, and allows users to leverage a 
> wide range of tools that are already available for Postgres (and many 
> commercial database systems based on Postgres).
> Some of the problems we'd need to figure out are:
> 1. What is the Postgres protocol? Is there an official documentation for it?
> 2. How difficult would it be to implement that protocol in Spark (JVM in 
> particular).
> 3. How does data type mapping work?
> 4. How does system commands work? Would Spark need to support all of 
> Postgres' commands?
> 5. Any restrictions in supporting nested data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15816) SQL server based on Postgres protocol

2016-08-18 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-15816:
-
Attachment: New_SQL_Server_for_Spark.pdf

> SQL server based on Postgres protocol
> -
>
> Key: SPARK-15816
> URL: https://issues.apache.org/jira/browse/SPARK-15816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: New_SQL_Server_for_Spark.pdf
>
>
> At Spark Summit today this idea came up from a discussion: it would be great 
> to investigate the possibility of implementing a new SQL server using 
> Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket 
> to track this idea, in case others have feedback.
> This server can have a simpler architecture, and allows users to leverage a 
> wide range of tools that are already available for Postgres (and many 
> commercial database systems based on Postgres).
> Some of the problems we'd need to figure out are:
> 1. What is the Postgres protocol? Is there an official documentation for it?
> 2. How difficult would it be to implement that protocol in Spark (JVM in 
> particular).
> 3. How does data type mapping work?
> 4. How does system commands work? Would Spark need to support all of 
> Postgres' commands?
> 5. Any restrictions in supporting nested data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17140) Add initial model to MultinomialLogisticRegression

2016-08-18 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427664#comment-15427664
 ] 

Seth Hendrickson commented on SPARK-17140:
--

I can take this one.

> Add initial model to MultinomialLogisticRegression
> --
>
> Key: SPARK-17140
> URL: https://issues.apache.org/jira/browse/SPARK-17140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should add initial model support to Multinomial logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16822) Support latex in scaladoc with MathJax

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427661#comment-15427661
 ] 

Apache Spark commented on SPARK-16822:
--

User 'jagadeesanas2' has created a pull request for this issue:
https://github.com/apache/spark/pull/14711

> Support latex in scaladoc with MathJax
> --
>
> Key: SPARK-16822
> URL: https://issues.apache.org/jira/browse/SPARK-16822
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Shuai Lin
>Assignee: Shuai Lin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
> but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
> class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].
> We can improve this by including MathJax javascripts in the scaladocs page, 
> much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17151:


 Summary: Decide how to handle inferring number of classes in 
Multinomial logistic regression
 Key: SPARK-17151
 URL: https://issues.apache.org/jira/browse/SPARK-17151
 Project: Spark
  Issue Type: Sub-task
Reporter: Seth Hendrickson
Priority: Minor


This JIRA is to discuss how the number of label classes should be inferred in 
multinomial logistic regression. Currently, MLOR checks the dataframe metadata 
and if the number of classes is not specified then it uses the maximum value 
seen in the label column. If the labels are not properly indexed, then this can 
cause a large number of zero coefficients and potentially produce instabilities 
in model training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16216:

Target Version/s: 2.0.1, 2.1.0
Priority: Blocker  (was: Major)

> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>  Labels: releasenotes
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16533) Spark application not handling preemption messages

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427625#comment-15427625
 ] 

Apache Spark commented on SPARK-16533:
--

User 'angolon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14710

> Spark application not handling preemption messages
> --
>
> Key: SPARK-16533
> URL: https://issues.apache.org/jira/browse/SPARK-16533
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>Affects Versions: 1.6.0
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>Reporter: Lucas Winkelmann
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16533) Spark application not handling preemption messages

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16533:


Assignee: Apache Spark

> Spark application not handling preemption messages
> --
>
> Key: SPARK-16533
> URL: https://issues.apache.org/jira/browse/SPARK-16533
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>Affects Versions: 1.6.0
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>Reporter: Lucas Winkelmann
>Assignee: Apache Spark
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16533) Spark application not handling preemption messages

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16533:


Assignee: (was: Apache Spark)

> Spark application not handling preemption messages
> --
>
> Key: SPARK-16533
> URL: https://issues.apache.org/jira/browse/SPARK-16533
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>Affects Versions: 1.6.0
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>Reporter: Lucas Winkelmann
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17150) Support SQL generation for inline tables

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17150:


Assignee: (was: Apache Spark)

> Support SQL generation for inline tables
> 
>
> Key: SPARK-17150
> URL: https://issues.apache.org/jira/browse/SPARK-17150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Peter Lee
>
> Inline tables currently do not support SQL generation, and as a result a view 
> that depends on inline tables would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17150) Support SQL generation for inline tables

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17150:


Assignee: Apache Spark

> Support SQL generation for inline tables
> 
>
> Key: SPARK-17150
> URL: https://issues.apache.org/jira/browse/SPARK-17150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>
> Inline tables currently do not support SQL generation, and as a result a view 
> that depends on inline tables would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17150) Support SQL generation for inline tables

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427589#comment-15427589
 ] 

Apache Spark commented on SPARK-17150:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14709

> Support SQL generation for inline tables
> 
>
> Key: SPARK-17150
> URL: https://issues.apache.org/jira/browse/SPARK-17150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Peter Lee
>
> Inline tables currently do not support SQL generation, and as a result a view 
> that depends on inline tables would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17150) Support SQL generation for inline tables

2016-08-18 Thread Peter Lee (JIRA)

Peter Lee created SPARK-17150:
-

 Summary: Support SQL generation for inline tables
 Key: SPARK-17150
 URL: https://issues.apache.org/jira/browse/SPARK-17150
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Peter Lee


Inline tables currently do not support SQL generation, and as a result a view 
that depends on inline tables would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17145) Object with many fields causes Seq Serialization Bug

2016-08-18 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427565#comment-15427565
 ] 

Liwei Lin commented on SPARK-17145:
---

hi [~abdulla16] can you try https://github.com/apache/spark/pull/14698 out and 
see if it solves your problem? Thanks!

> Object with many fields causes Seq Serialization Bug 
> -
>
> Key: SPARK-17145
> URL: https://issues.apache.org/jira/browse/SPARK-17145
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: OS: OSX El Capitan 10.11.6
>Reporter: Abdulla Al-Qawasmeh
>
> The unit test here 
> (https://gist.github.com/abdulla16/433faf7df59fce11a5fff284bac0d945) 
> describes the problem. 
> It looks like Spark is having problems serializing a Scala Seq when it's part 
> of an object with many fields (I'm not 100% sure it's a serialization 
> problem). The deserialized Seq ends up with as many items as the original 
> Seq, however, all the items are copies of the last item in the original Seq.
> The object that I used in my unit test (as an example) is a Tuple5. However, 
> I've seen this behavior in other types of objects. 
> Reducing MyClass5 to only two fields (field34 and field35) causes the unit 
> test to pass. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2016-08-18 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427563#comment-15427563
 ] 

Yanbo Liang commented on SPARK-17137:
-

I think we should provide transparent interface to users rather than exposing a 
param to control whether output dense/sparse coefficients. Spark MLlib 
{{Vector.compressed}} returns a vector in either dense or sparse format, 
whichever uses less storage. I would like to do the performance tests for this 
issue. Thanks!

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17149) array.sql for testing array related functions

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427557#comment-15427557
 ] 

Apache Spark commented on SPARK-17149:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14708

> array.sql for testing array related functions
> -
>
> Key: SPARK-17149
> URL: https://issues.apache.org/jira/browse/SPARK-17149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17149) array.sql for testing array related functions

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17149:


Assignee: (was: Apache Spark)

> array.sql for testing array related functions
> -
>
> Key: SPARK-17149
> URL: https://issues.apache.org/jira/browse/SPARK-17149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17149) array.sql for testing array related functions

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17149:


Assignee: Apache Spark

> array.sql for testing array related functions
> -
>
> Key: SPARK-17149
> URL: https://issues.apache.org/jira/browse/SPARK-17149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16914) NodeManager crash when spark are registering executor infomartion into leveldb

2016-08-18 Thread cen yuhai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427556#comment-15427556
 ] 

cen yuhai commented on SPARK-16914:
---

[~jerryshao] hi, saisai, I think SPARK-14963 is useless because function 
getRecoveryPath will choose the first directory in 
"yarn.nodemanager.local-dirs",  it should be a random number

> NodeManager crash when spark are registering executor infomartion into leveldb
> --
>
> Key: SPARK-16914
> URL: https://issues.apache.org/jira/browse/SPARK-16914
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> {noformat}
> Stack: [0x7fb5b53de000,0x7fb5b54df000],  sp=0x7fb5b54dcba8,  free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  [libc.so.6+0x896b1]  memcpy+0x11
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  
> org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0
> j  
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11
> j  
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18
> j  
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36
> j  
> org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28
> j  org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10
> j  
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61
> J 8429 C2 
> org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V
>  (100 bytes) @ 0x7fb5f27ff6cc [0x7fb5f27fdde0+0x18ec]
> J 8371 C2 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (10 bytes) @ 0x7fb5f242df20 [0x7fb5f242de80+0xa0]
> J 6853 C2 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (74 bytes) @ 0x7fb5f215587c [0x7fb5f21557e0+0x9c]
> J 5872 C2 
> io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (42 bytes) @ 0x7fb5f2183268 [0x7fb5f2183100+0x168]
> J 5849 C2 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (158 bytes) @ 0x7fb5f2191524 [0x7fb5f218f5a0+0x1f84]
> J 5941 C2 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
>  (170 bytes) @ 0x7fb5f220a230 [0x7fb5f2209fc0+0x270]
> J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V 
> (363 bytes) @ 0x7fb5f264465c [0x7fb5f2644140+0x51c]
> J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @ 
> 0x7fb5f26f6764 [0x7fb5f26f63c0+0x3a4]
> j  io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {noformat}
> The target code in spark is in ExternalShuffleBlockResolver
> {code}
>   /** Registers a new Executor with all the configuration we need to find its 
> shuffle files. */
>   public void registerExecutor(
>   String appId,
>   String execId,
>   ExecutorShuffleInfo executorInfo) {
> AppExecId fullId = new AppExecId(appId, execId);
> logger.info("Registered executor {} with {}", fullId, executorInfo);
> try {
>   if (db != null) {
> byte[] key = dbAppExecKey(fullId);
> byte[] value =  
> mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8);
> db.put(key, value);
>   }
> } catch (Exception e) {
>   logger.error("Error saving registered executors", e);
> }
> executors.put(fullId, executorInfo);
>   }
> {code}
> There is a problem with disk1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17149) array.sql for testing array related functions

2016-08-18 Thread Peter Lee (JIRA)

Peter Lee created SPARK-17149:
-

 Summary: array.sql for testing array related functions
 Key: SPARK-17149
 URL: https://issues.apache.org/jira/browse/SPARK-17149
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Peter Lee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17148) NodeManager exit because of exception “Executor is not registered”

2016-08-18 Thread cen yuhai (JIRA)

cen yuhai created SPARK-17148:
-

 Summary: NodeManager exit because of exception “Executor is not 
registered”
 Key: SPARK-17148
 URL: https://issues.apache.org/jira/browse/SPARK-17148
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.6.2
 Environment: hadoop 2.7.2 spark 1.6.2
Reporter: cen yuhai


java.lang.RuntimeException: Executor is not registered 
(appId=application_1467288504738_1341061, execId=423)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2016-08-18 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427539#comment-15427539
 ] 

Yanbo Liang commented on SPARK-17136:
-

I would like to know that users' own optimizers have some standard API similar 
with breeze {{LBFGS}} or others?

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-08-18 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427519#comment-15427519
 ] 

Weichen Xu edited comment on SPARK-17139 at 8/19/16 3:05 AM:
-

I will work on it and create a PR when the dependent algorithm merged, thanks.


was (Author: weichenxu123):
I will work on it and create PR soon, thanks.

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17138) Python API for multinomial logistic regression

2016-08-18 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427518#comment-15427518
 ] 

Weichen Xu edited comment on SPARK-17138 at 8/19/16 3:06 AM:
-

I will work on it and create a PR when the dependent algorithm merged, thanks.


was (Author: weichenxu123):
I will work on it and create PR soon, thanks.

> Python API for multinomial logistic regression
> --
>
> Key: SPARK-17138
> URL: https://issues.apache.org/jira/browse/SPARK-17138
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, 
> we should make a Python API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-18 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427529#comment-15427529
 ] 

Yanbo Liang edited comment on SPARK-17134 at 8/19/16 3:04 AM:
--

This is interesting. We also trying to use BLAS to accelerate linear algebra 
operations in other algorithms such as {{KMeans/ALS}} and I have some basic 
performance test result. I would like to contribute to this task after 
SPARK-7159 finished. Thanks!


was (Author: yanboliang):
This is interesting. We also trying to use BLAS to accelerate linear algebra 
operations in other algorithms such as {{KMeans/ALS}} and I have some basic 
performance test result. I would like to contribute to this task. Thanks!

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-18 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427529#comment-15427529
 ] 

Yanbo Liang commented on SPARK-17134:
-

This is interesting. We also trying to use BLAS to accelerate linear algebra 
operations in other algorithms such as {{KMeans/ALS}} and I have some basic 
performance test result. I would like to contribute to this task. Thanks!

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-08-18 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427519#comment-15427519
 ] 

Weichen Xu commented on SPARK-17139:


I will work on it and create PR soon, thanks.

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17138) Python API for multinomial logistic regression

2016-08-18 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427518#comment-15427518
 ] 

Weichen Xu commented on SPARK-17138:


I will work on it and create PR soon, thanks.

> Python API for multinomial logistic regression
> --
>
> Key: SPARK-17138
> URL: https://issues.apache.org/jira/browse/SPARK-17138
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, 
> we should make a Python API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16947) Support type coercion and foldable expression for inline tables

2016-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16947:

Fix Version/s: 2.0.1

> Support type coercion and foldable expression for inline tables
> ---
>
> Key: SPARK-16947
> URL: https://issues.apache.org/jira/browse/SPARK-16947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> Inline tables were added in to Spark SQL in 2.0, e.g.: {{select * from values 
> (1, 'A'), (2, 'B') as tbl(a, b)}}
> This is currently implemented using a {{LocalRelation}} and this relation is 
> created during parsing. This has several weaknesses: you can only use simple 
> expressions in such a plan, and type coercion is based on the first row in 
> the relation, and all subsequent values are cast in to this type. The latter 
> violates the principle of least surprise.
> I would like to rewrite this into a union of projects; each of these projects 
> would contain a single table row. We apply better type coercion rules to a 
> union, and we should be able to rewrite this into a local relation during 
> optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427469#comment-15427469
 ] 

Reynold Xin commented on SPARK-17069:
-

I've also backported this into branch-2.0 since it is a small testing util.


> Expose spark.range() as table-valued function in SQL
> 
>
> Key: SPARK-17069
> URL: https://issues.apache.org/jira/browse/SPARK-17069
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> The idea here is to create the spark.range( x ) equivalent in SQL, so we can 
> do something like
> {noformat}
> select count(*) from range(1)
> {noformat}
> This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17069:

Fix Version/s: 2.0.1

> Expose spark.range() as table-valued function in SQL
> 
>
> Key: SPARK-17069
> URL: https://issues.apache.org/jira/browse/SPARK-17069
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> The idea here is to create the spark.range( x ) equivalent in SQL, so we can 
> do something like
> {noformat}
> select count(*) from range(1)
> {noformat}
> This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-18 Thread Robert Conrad (JIRA)

Robert Conrad created SPARK-17147:
-

 Summary: Spark Streaming Kafka 0.10 Consumer Can't Handle 
Non-consecutive Offsets
 Key: SPARK-17147
 URL: https://issues.apache.org/jira/browse/SPARK-17147
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Robert Conrad


When Kafka does log compaction offsets often end up with gaps, meaning the next 
requested offset will be frequently not be offset+1. The logic in KafkaRDD & 
CachedKafkaConsumer has a baked in assumption that the next offset will always 
be just an increment of 1 above the previous offset. 

I have worked around this problem by changing CachedKafkaConsumer to use the 
returned record's offset, from:
{{nextOffset = offset + 1}}
to:
{{nextOffset = record.offset + 1}}

and changed KafkaRDD from:
{{requestOffset += 1}}
to:
{{requestOffset = r.offset() + 1}}

(I also had to change some assert logic in CachedKafkaConsumer).

There's a strong possibility that I have misconstrued how to use the streaming 
kafka consumer, and I'm happy to close this out if that's the case. If, 
however, it is supposed to support non-consecutive offsets (e.g. due to log 
compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16947) Support type coercion and foldable expression for inline tables

2016-08-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16947.
-
Resolution: Fixed

> Support type coercion and foldable expression for inline tables
> ---
>
> Key: SPARK-16947
> URL: https://issues.apache.org/jira/browse/SPARK-16947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Peter Lee
> Fix For: 2.1.0
>
>
> Inline tables were added in to Spark SQL in 2.0, e.g.: {{select * from values 
> (1, 'A'), (2, 'B') as tbl(a, b)}}
> This is currently implemented using a {{LocalRelation}} and this relation is 
> created during parsing. This has several weaknesses: you can only use simple 
> expressions in such a plan, and type coercion is based on the first row in 
> the relation, and all subsequent values are cast in to this type. The latter 
> violates the principle of least surprise.
> I would like to rewrite this into a union of projects; each of these projects 
> would contain a single table row. We apply better type coercion rules to a 
> union, and we should be able to rewrite this into a local relation during 
> optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16947) Support type coercion and foldable expression for inline tables

2016-08-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16947:

Fix Version/s: 2.1.0

> Support type coercion and foldable expression for inline tables
> ---
>
> Key: SPARK-16947
> URL: https://issues.apache.org/jira/browse/SPARK-16947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Peter Lee
> Fix For: 2.1.0
>
>
> Inline tables were added in to Spark SQL in 2.0, e.g.: {{select * from values 
> (1, 'A'), (2, 'B') as tbl(a, b)}}
> This is currently implemented using a {{LocalRelation}} and this relation is 
> created during parsing. This has several weaknesses: you can only use simple 
> expressions in such a plan, and type coercion is based on the first row in 
> the relation, and all subsequent values are cast in to this type. The latter 
> violates the principle of least surprise.
> I would like to rewrite this into a union of projects; each of these projects 
> would contain a single table row. We apply better type coercion rules to a 
> union, and we should be able to rewrite this into a local relation during 
> optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-18 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427429#comment-15427429
 ] 

Sital Kedia commented on SPARK-16922:
-

Kryo 

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable

2016-08-18 Thread Qian Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427425#comment-15427425
 ] 

Qian Huang commented on SPARK-17090:


Gotcha. I will do the api first.

> Make tree aggregation level in linear/logistic regression configurable
> --
>
> Key: SPARK-17090
> URL: https://issues.apache.org/jira/browse/SPARK-17090
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Linear/logistic regression use treeAggregate with default aggregation depth 
> for collecting coefficient gradient updates to the driver. For high 
> dimensional problems, this can case OOM error on the driver. We should make 
> it configurable, perhaps via an expert param, so that users can avoid this 
> problem if their data has many features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-18 Thread Alberto Bonsanto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427409#comment-15427409
 ] 

Alberto Bonsanto commented on SPARK-17141:
--

Crude data.

| id|chicken|jam|roast beef|
|  1|NaN|2.0|   2.0|
|  2|2.0|0.0|   2.0|
|  3|NaN|0.0|   2.0|
|  4|2.0|1.0|  -2.0|
|  5|2.0|2.0|   2.0|
|  6|2.0|2.0|   NaN|

After assemble and normalization, as you can see {{Double.NaN}} are replaced 
for {{0.5}}.

|id |chicken|jam|roast beef|features  |featuresNorm |
|1  |NaN|2.0|2.0   |[NaN,2.0,2.0] |[0.5,1.0,1.0]|
|2  |2.0|0.0|2.0   |[2.0,0.0,2.0] |[0.5,0.0,1.0]|
|3  |NaN|0.0|2.0   |[NaN,0.0,2.0] |[0.5,0.0,1.0]|
|4  |2.0|1.0|-2.0  |[2.0,1.0,-2.0]|[0.5,0.5,0.0]|
|5  |2.0|2.0|2.0   |[2.0,2.0,2.0] |[0.5,1.0,1.0]|
|6  |2.0|2.0|NaN   |[2.0,2.0,NaN] |[0.5,1.0,NaN]|




> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Priority: Trivial
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17146) Add RandomizedSearch to the CrossValidator API

2016-08-18 Thread Manoj Kumar (JIRA)

Manoj Kumar created SPARK-17146:
---

 Summary: Add RandomizedSearch to the CrossValidator API
 Key: SPARK-17146
 URL: https://issues.apache.org/jira/browse/SPARK-17146
 Project: Spark
  Issue Type: Improvement
Reporter: Manoj Kumar


Hi, I would like to add randomized search support for the Cross-Validator API. 
It should be quite straightforward to add with the present abstractions.

Here is the proposed API:
(Names are up for debate)

Proposed Classes:
"ParamSamplerBuilder" or a "ParamRandomizedBuilder" that returns an
Array of ParamMaps

Proposed Methods:
"addBounds"
"addSampler"
"setNumIter"

Code example:
{code}
def sampler(): Double = {
Math.pow(10.0, -5 + Random.nextFloat * (5 - (-5))
}
val paramGrid = new ParamRandomizedBuilder()
  .addSampler(lr.regParam, sampler)
  .addBounds(lr.elasticNetParam, 0.0, 1.0)
  .setNumIter(10)
  .build()
{code}

Let me know your thoughts!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427394#comment-15427394
 ] 

Andrew Davidson commented on SPARK-17143:
-

See email from user's group. I was able to find a work around. Not sure how 
hdfs:///tmp/ got created or how the permissions got messed up

##

NICE CATCH!!! Many thanks. 

I spent all day on this bug

The error msg report /tmp. I did not think to look on hdfs.

[ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/
Found 1 items
-rw-r--r--   3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp
[ec2-user@ip-172-31-22-140 notebooks]$ 


I have no idea how hdfs:///tmp got created. I deleted it. 

This causes a bunch of exceptions. These exceptions has useful message. I was 
able to fix the problem as follows

$ hadoop fs -rmr hdfs:///tmp

Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong

$ hadoop fs -chmod 777 hdfs:///tmp/hive


From: Felix Cheung 
Date: Thursday, August 18, 2016 at 3:37 PM
To: Andrew Davidson , "user @spark" 

Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp

Do you have a file called tmp at / on HDFS?




> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes =

[jira] [Created] (SPARK-17145) Object with many fields causes Seq Serialization Bug

2016-08-18 Thread Abdulla Al-Qawasmeh (JIRA)

Abdulla Al-Qawasmeh created SPARK-17145:
---

 Summary: Object with many fields causes Seq Serialization Bug 
 Key: SPARK-17145
 URL: https://issues.apache.org/jira/browse/SPARK-17145
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: OS: OSX El Capitan 10.11.6

Reporter: Abdulla Al-Qawasmeh


The unit test here 
(https://gist.github.com/abdulla16/433faf7df59fce11a5fff284bac0d945) describes 
the problem. 

It looks like Spark is having problems serializing a Scala Seq when it's part 
of an object with many fields (I'm not 100% sure it's a serialization problem). 
The deserialized Seq ends up with as many items as the original Seq, however, 
all the items are copies of the last item in the original Seq.

The object that I used in my unit test (as an example) is a Tuple5. However, 
I've seen this behavior in other types of objects. 

Reducing MyClass5 to only two fields (field34 and field35) causes the unit test 
to pass. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17144:


Assignee: (was: Apache Spark)

> Removal of useless CreateHiveTableAsSelectLogicalPlan
> -
>
> Key: SPARK-17144
> URL: https://issues.apache.org/jira/browse/SPARK-17144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427383#comment-15427383
 ] 

Apache Spark commented on SPARK-17144:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14707

> Removal of useless CreateHiveTableAsSelectLogicalPlan
> -
>
> Key: SPARK-17144
> URL: https://issues.apache.org/jira/browse/SPARK-17144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17144:


Assignee: Apache Spark

> Removal of useless CreateHiveTableAsSelectLogicalPlan
> -
>
> Key: SPARK-17144
> URL: https://issues.apache.org/jira/browse/SPARK-17144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan

2016-08-18 Thread Xiao Li (JIRA)

Xiao Li created SPARK-17144:
---

 Summary: Removal of useless CreateHiveTableAsSelectLogicalPlan
 Key: SPARK-17144
 URL: https://issues.apache.org/jira/browse/SPARK-17144
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


{{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17081) Empty strings not preserved which causes SQLException: mismatching column value count

2016-08-18 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427380#comment-15427380
 ] 

Xiao Li commented on SPARK-17081:
-

Can you try to reproduce it in Spark 2.0? Thanks!

> Empty strings not preserved which causes SQLException: mismatching column 
> value count
> -
>
> Key: SPARK-17081
> URL: https://issues.apache.org/jira/browse/SPARK-17081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Ian Hellstrom
>  Labels: dataframe, empty, jdbc, null, sql
>
> When writing a DataFrame that contains empty strings as values to an RDBMS, 
> the query that is generated does not have the correct column count:
> {code}
> CREATE TABLE demo(foo INTEGER, bar VARCHAR(10));
> -
> case class Record(foo: Int, bar: String)
> val data = sc.parallelize(List(Record(1, ""))).toDF
> data.write.mode("append").jdbc(...)
> {code}
> This causes:
> {code}
> java.sql.SQLException: Column count doesn't match value count at row 1
> {code}
> Proposal: leave empty strings as they are or convert these to NULL (although 
> that may not be what's intended by the user, so make this configurable). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427278#comment-15427278
 ] 

Andrew Davidson commented on SPARK-17143:
-

given the exception metioned an issue with /tmp I decide to track how /tmp 
changed when run my cell

# no spark jobs are running
[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start notebook server
$ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &

[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start the udfBug notebook
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# execute cell that define UDF
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user  
spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9
[ec2-user@ip-172-31-22-140 notebooks]$ 

[ec2-user@ip-172-31-22-140 notebooks]$ find 
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-18 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427264#comment-15427264
 ] 

Davies Liu commented on SPARK-16922:


Which serializer are you using? java serializer or Kyro?

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17143:

Attachment: udfBug.html

This html version of the notebook shows the output when run in my data center

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 answer = self._gateway_client.send_command(command)
>1063 return_value = get_return_value(
> -> 10

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-18 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427259#comment-15427259
 ] 

Sital Kedia commented on SPARK-16922:
-

>> Could you also try to disable the dense mode?

I tried disabling the dense mode, that did not help either. 

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17143:

Attachment: udfBug.ipynb

The attached notebook demonstrated the reported bug. Note it includes the 
output when run on my mac book pro. The bug report contains the stack trace 
when the same code is run in my data center

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 answ

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-18 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427250#comment-15427250
 ] 

Sital Kedia commented on SPARK-16922:
-

The failure is deterministic, we are reproducing the issue for every run of the 
job (Its not only one job, there are multiple jobs that are failing because of 
this). For now, we have made a change to not use the  LongHashedRelation to 
workaround this issue. 

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-17143:
---

 Summary: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp
 Key: SPARK-17143
 URL: https://issues.apache.org/jira/browse/SPARK-17143
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
Reporter: Andrew Davidson


For unknown reason I can not create UDF when I run the attached notebook on my 
cluster. I get the following error

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: 
Parent path is not a directory: /tmp tmp

The notebook runs fine on my Mac

In general I am able to run non UDF spark code with out any trouble

I start the notebook server as the user “ec2-user" and uses master URL 
spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066


I found the following message in the notebook server log file. I have log level 
set to warn

16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException


The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2



#from pyspark.sql import SQLContext, HiveContext
#sqlContext = SQLContext(sc)

#from pyspark.sql import DataFrame
#from pyspark.sql import functions

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

print("spark version: {}".format(sc.version))

import sys
print("python version: {}".format(sys.version))
spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]



# functions.lower() raises 
# py4j.Py4JException: Method lower([class java.lang.String]) does not exist
# work around define a UDF
toLowerUDFRetType = StringType()
#toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
toLowerUDF = udf(lambda s : s.lower(), StringType())
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly
Py4JJavaErrorTraceback (most recent call last)
 in ()
  4 toLowerUDFRetType = StringType()
  5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> 6 toLowerUDF = udf(lambda s : s.lower(), StringType())

/root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
   1595 [Row(slen=5), Row(slen=3)]
   1596 """
-> 1597 return UserDefinedFunction(f, returnType)
   1598
   1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']

/root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, 
name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559
   1560 def _create_judf(self, name):

/root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__

/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
690
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693
694 def refreshTable(self, tableName):

/root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, 
*args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065
   1066 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred

[jira] [Comment Edited] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-18 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427241#comment-15427241
 ] 

Davies Liu edited comment on SPARK-16922 at 8/18/16 9:58 PM:
-

Is this failure determistic or not? Happened on every task or some or them? 
Could you also try to disable the dense mode?


was (Author: davies):
Is this failure determistic or not? Happened on every task or some or them?

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-18 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427241#comment-15427241
 ] 

Davies Liu commented on SPARK-16922:


Is this failure determistic or not? Happened on every task or some or them?

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17142) Complex query triggers binding error in HashAggregateExec

2016-08-18 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427236#comment-15427236
 ] 

Josh Rosen commented on SPARK-17142:


Interestingly, this query executes fine if the repeated addition in the SELECT 
clause is replaced by {{* 2}} instead.

> Complex query triggers binding error in HashAggregateExec
> -
>
> Key: SPARK-17142
> URL: https://issues.apache.org/jira/browse/SPARK-17142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> The following example runs successfully on Spark 2.0.0 but fails in the 
> current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not 
> earlier):
> {code}
> spark.sql("set spark.sql.crossJoin.enabled=true")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2")
> val query = """
> SELECT
> ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS 
> int_col_1
> FROM table_2 t1
> INNER JOIN (
> SELECT
> LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), 
> -991) AS int_col,
> (t2.int_col_1) + (t1.int_col_1) AS int_col_2,
> (t1.int_col_1) + (t2.int_col_1) AS int_col_3,
> t2.int_col_1
> FROM
> table_4 t1,
> table_4 t2
> GROUP BY
> (t1.int_col_1) + (t2.int_col_1),
> t2.int_col_1
> ) t2
> WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1)
> GROUP BY (t2.int_col) + (t1.bigint_col_2)
> """
> spark.sql(query).collect()
> {code}
> This fails with the following exception:
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: bigint_col_2#65
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.Traver

[jira] [Created] (SPARK-17142) Complex query triggers binding error in HashAggregateExec

2016-08-18 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17142:
--

 Summary: Complex query triggers binding error in HashAggregateExec
 Key: SPARK-17142
 URL: https://issues.apache.org/jira/browse/SPARK-17142
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Josh Rosen
Priority: Blocker


The following example runs successfully on Spark 2.0.0 but fails in the current 
master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not earlier):

{code}
spark.sql("set spark.sql.crossJoin.enabled=true")

sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2")

val query = """
SELECT
((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS 
int_col_1
FROM table_2 t1
INNER JOIN (
SELECT
LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), -991) 
AS int_col,
(t2.int_col_1) + (t1.int_col_1) AS int_col_2,
(t1.int_col_1) + (t2.int_col_1) AS int_col_3,
t2.int_col_1
FROM
table_4 t1,
table_4 t2
GROUP BY
(t1.int_col_1) + (t2.int_col_1),
t2.int_col_1
) t2
WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1)
GROUP BY (t2.int_col) + (t1.bigint_col_2)
"""

sql(query).collect()
{code}

This fails with the following exception:

{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: bigint_col_2#65
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:454)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce

[jira] [Updated] (SPARK-17142) Complex query triggers binding error in HashAggregateExec

2016-08-18 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17142:
---
Description: 
The following example runs successfully on Spark 2.0.0 but fails in the current 
master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not earlier):

{code}
spark.sql("set spark.sql.crossJoin.enabled=true")

sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2")

val query = """
SELECT
((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS 
int_col_1
FROM table_2 t1
INNER JOIN (
SELECT
LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), -991) 
AS int_col,
(t2.int_col_1) + (t1.int_col_1) AS int_col_2,
(t1.int_col_1) + (t2.int_col_1) AS int_col_3,
t2.int_col_1
FROM
table_4 t1,
table_4 t2
GROUP BY
(t1.int_col_1) + (t2.int_col_1),
t2.int_col_1
) t2
WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1)
GROUP BY (t2.int_col) + (t1.bigint_col_2)
"""

spark.sql(query).collect()
{code}

This fails with the following exception:

{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: bigint_col_2#65
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:454)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:145)
  at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
  at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.

[jira] [Commented] (SPARK-17133) Improvements to linear methods in Spark

2016-08-18 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427144#comment-15427144
 ] 

Xin Ren commented on SPARK-17133:
-

hi [~sethah] I'd like to help on this, please count me in. Thanks a lot :)

> Improvements to linear methods in Spark
> ---
>
> Key: SPARK-17133
> URL: https://issues.apache.org/jira/browse/SPARK-17133
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> This JIRA is for tracking several improvements that we should make to 
> Linear/Logistic regression in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427134#comment-15427134
 ] 

Apache Spark commented on SPARK-16508:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14705

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16904) Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry

2016-08-18 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427069#comment-15427069
 ] 

Tejas Patil commented on SPARK-16904:
-

Is Spark's hashing function semantically equivalent to Hive's ? AFAIK, its not. 
I think it would be better to have a mode to be able to use Hive's hash method. 
eg. case when this would be needed: Users running a query in Hive want to 
switch to Spark. As this happens, you want to verify if the data produced is 
same or not. Also, for a brief time the pipeline would run in both the engines. 
Upstream consumers of the data generated should not see differences due to 
running in the different engines

> Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry
> 
>
> Key: SPARK-16904
> URL: https://issues.apache.org/jira/browse/SPARK-16904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the Hive built-in `hash` function is not being used in Spark since 
> Spark 2.0. The public interface does not allow users to unregister the Spark 
> built-in functions. Thus, users will never use Hive's built-in `hash` 
> function. 
> The only exception here is `TestHiveFunctionRegistry`, which allows users to 
> unregister the built-in functions. Thus, we can load Hive's hash function in 
> the test cases. If we disable it, 10+ test cases will fail because the 
> results are different from the Hive golden answer files.
> This PR is to remove `hash` from the list of `hiveFunctions` in 
> `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This 
> removal makes us easier to remove `TestHiveSessionState` in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16077) Python UDF may fail because of six

2016-08-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16077:
-
Fix Version/s: 1.6.3

> Python UDF may fail because of six
> --
>
> Key: SPARK-16077
> URL: https://issues.apache.org/jira/browse/SPARK-16077
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.3, 2.0.0
>
>
> six or other package may break pickle.whichmodule() in pickle:
> https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426988#comment-15426988
 ] 

Sean Owen commented on SPARK-17141:
---

Summarize the reproduction here? best to put it all here for the record.
If you have a small fix and can describe it then someone else can commit it, 
though I think making a PR is a useful skill and not that hard. Worth taking a 
shot at it.

> MinMaxScaler behaves weird when min and max have the same value and some 
> values are NaN
> ---
>
> Key: SPARK-17141
> URL: https://issues.apache.org/jira/browse/SPARK-17141
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2, 2.0.0
> Environment: Databrick's Community, Spark 2.0 + Scala 2.10
>Reporter: Alberto Bonsanto
>Priority: Trivial
>
> When you have a {{DataFrame}} with a column named {{features}}, which is a 
> {{DenseVector}} and the *maximum* and *minimum* and some values are 
> {{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
> value, I believe.
> I know how to fix it, but I haven't ever made a pull request. You can check 
> the bug in this 
> [notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17141) MinMaxScaler behaves weird when min and max have the same value and some values are NaN

2016-08-18 Thread Alberto Bonsanto (JIRA)

Alberto Bonsanto created SPARK-17141:


 Summary: MinMaxScaler behaves weird when min and max have the same 
value and some values are NaN
 Key: SPARK-17141
 URL: https://issues.apache.org/jira/browse/SPARK-17141
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0, 1.6.2
 Environment: Databrick's Community, Spark 2.0 + Scala 2.10
Reporter: Alberto Bonsanto
Priority: Trivial


When you have a {{DataFrame}} with a column named {{features}}, which is a 
{{DenseVector}} and the *maximum* and *minimum* and some values are 
{{Double.NaN}} they get replaced by 0.5, and they should remain with the same 
value, I believe.

I know how to fix it, but I haven't ever made a pull request. You can check the 
bug in this 
[notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2485090270202665/3126465289264547/8589256059752547/latest.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17132) binaryFiles method can't handle paths with embedded commas

2016-08-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426962#comment-15426962
 ] 

Sean Owen commented on SPARK-17132:
---

Yeah, that would be a solution. It actually affects all related API methods of 
SparkContext, not just one. I'm not clear if it's worth adding a bunch to the 
RDD API now in Spark 2, but it's not out of the question.

It should work to escape the commas with \, or at least that's what the Hadoop 
classes appear to want done. I suppose that's the intended usage, though I also 
would prefer a more explicit seq argument.

> binaryFiles method can't handle paths with embedded commas
> --
>
> Key: SPARK-17132
> URL: https://issues.apache.org/jira/browse/SPARK-17132
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Maximilian Najork
>
> A path with an embedded comma is treated as two separate paths by 
> binaryFiles. Since commas are legal characters in paths, this behavior is 
> incorrect. I recommend overloading binaryFiles to accept an array of path 
> strings in addition to a string of comma-separated paths. Since setInputPaths 
> is already overloaded to accept either form, this should be relatively 
> low-effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17132) binaryFiles method can't handle paths with embedded commas

2016-08-18 Thread Maximilian Najork (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426965#comment-15426965
 ] 

Maximilian Najork commented on SPARK-17132:
---

I tried escaping the commas prior to filing this ticket and it still exhibited 
the behavior. It's possible I was doing something incorrectly.

> binaryFiles method can't handle paths with embedded commas
> --
>
> Key: SPARK-17132
> URL: https://issues.apache.org/jira/browse/SPARK-17132
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Maximilian Najork
>
> A path with an embedded comma is treated as two separate paths by 
> binaryFiles. Since commas are legal characters in paths, this behavior is 
> incorrect. I recommend overloading binaryFiles to accept an array of path 
> strings in addition to a string of comma-separated paths. Since setInputPaths 
> is already overloaded to accept either form, this should be relatively 
> low-effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17140) Add initial model to MultinomialLogisticRegression

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17140:


 Summary: Add initial model to MultinomialLogisticRegression
 Key: SPARK-17140
 URL: https://issues.apache.org/jira/browse/SPARK-17140
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson


We should add initial model support to Multinomial logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17138) Python API for multinomial logistic regression

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17138:


 Summary: Python API for multinomial logistic regression
 Key: SPARK-17138
 URL: https://issues.apache.org/jira/browse/SPARK-17138
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson


Once [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] is merged, 
we should make a Python API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17139:


 Summary: Add model summary for MultinomialLogisticRegression
 Key: SPARK-17139
 URL: https://issues.apache.org/jira/browse/SPARK-17139
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson


Add model summary to multinomial logistic regression using same interface as in 
other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17137:


 Summary: Add compressed support for multinomial logistic 
regression coefficients
 Key: SPARK-17137
 URL: https://issues.apache.org/jira/browse/SPARK-17137
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


For sparse coefficients in MLOR, such as when high L1 regularization, it may be 
more efficient to store coefficients in compressed format. We can add this 
option to MLOR and perhaps to do some performance tests to verify improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17136) Design optimizer interface for ML algorithms

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17136:


 Summary: Design optimizer interface for ML algorithms
 Key: SPARK-17136
 URL: https://issues.apache.org/jira/browse/SPARK-17136
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson


We should consider designing an interface that allows users to use their own 
optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17133) Improvements to linear methods in Spark

2016-08-18 Thread Seth Hendrickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-17133:
-
Description: This JIRA is for tracking several improvements that we should 
make to Linear/Logistic regression in Spark.  (was: This JIRA is for tracking 
several improvements that we should make to Linear/Logistic regression in 
Spark. Many of them are follow ups to 
[SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159].)

> Improvements to linear methods in Spark
> ---
>
> Key: SPARK-17133
> URL: https://issues.apache.org/jira/browse/SPARK-17133
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> This JIRA is for tracking several improvements that we should make to 
> Linear/Logistic regression in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17135) Consolidate code in linear/logistic regression where possible

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17135:


 Summary: Consolidate code in linear/logistic regression where 
possible
 Key: SPARK-17135
 URL: https://issues.apache.org/jira/browse/SPARK-17135
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


There is shared code between MultinomialLogisticRegression, LogisticRegression, 
and LinearRegression. We should consolidate where possible. Also, we should 
move some code out of LogisticRegression.scala into a separate util file or 
similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable

2016-08-18 Thread Seth Hendrickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-17090:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17133

> Make tree aggregation level in linear/logistic regression configurable
> --
>
> Key: SPARK-17090
> URL: https://issues.apache.org/jira/browse/SPARK-17090
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Linear/logistic regression use treeAggregate with default aggregation depth 
> for collecting coefficient gradient updates to the driver. For high 
> dimensional problems, this can case OOM error on the driver. We should make 
> it configurable, perhaps via an expert param, so that users can avoid this 
> problem if their data has many features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17134:


 Summary: Use level 2 BLAS operations in LogisticAggregator
 Key: SPARK-17134
 URL: https://issues.apache.org/jira/browse/SPARK-17134
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Seth Hendrickson


Multinomial logistic regression uses LogisticAggregator class for gradient 
updates. We should look into refactoring MLOR to use level 2 BLAS operations 
for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17133) Improvements to linear methods in Spark

2016-08-18 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17133:


 Summary: Improvements to linear methods in Spark
 Key: SPARK-17133
 URL: https://issues.apache.org/jira/browse/SPARK-17133
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Reporter: Seth Hendrickson


This JIRA is for tracking several improvements that we should make to 
Linear/Logistic regression in Spark. Many of them are follow ups to 
[SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17132) binaryFiles method can't handle paths with embedded commas

2016-08-18 Thread Maximilian Najork (JIRA)

Maximilian Najork created SPARK-17132:
-

 Summary: binaryFiles method can't handle paths with embedded commas
 Key: SPARK-17132
 URL: https://issues.apache.org/jira/browse/SPARK-17132
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.4.1, 
1.4.0, 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0
Reporter: Maximilian Najork


A path with an embedded comma is treated as two separate paths by binaryFiles. 
Since commas are legal characters in paths, this behavior is incorrect. I 
recommend overloading binaryFiles to accept an array of path strings in 
addition to a string of comma-separated paths. Since setInputPaths is already 
overloaded to accept either form, this should be relatively low-effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16981) For CSV files nullValue is not respected for Date/Time data type

2016-08-18 Thread Lev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lev updated SPARK-16981:

Priority: Critical  (was: Major)

> For CSV files nullValue is not respected for Date/Time data type
> 
>
> Key: SPARK-16981
> URL: https://issues.apache.org/jira/browse/SPARK-16981
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Lev
>Priority: Critical
>
> Test case
>   val struct = StructType(Seq(StructField("col1", StringType, 
> true),StructField("col2", TimestampType, true), Seq(StructField("col3", 
> StringType, true)))
>   val cq = sqlContext.readStream
> .format("csv")
> .option("nullValue", " ")
> .schema(struct)
> .load(s"somepath")
> .writeStream()
> content of the file
> "abc", ,"def"
> Result:
> Exception is thrown:
> scala.MatchError: java.lang.IllegalArgumentException: Timestamp format must 
> be -mm-dd hh:mm:ss[.f] (of class 
> java.lang.IllegalArgumentException)
> Code analysis:
> Problem is caused by code in castTo method of CSVTypeCast object
> For all data types except temporal there is the following check:
> if (datum == options.nullValue && nullable) {
>   null
> }
> But for temporal types it is missing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable

2016-08-18 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426876#comment-15426876
 ] 

DB Tsai commented on SPARK-17090:
-

Since having a formula of determining the aggregation depth is pretty tricky, 
and this will depend on the memory setting of driver, the dimension of 
problems, and the number of partition, etc. This will take longer to discuss 
and have a proper implementation. Let's have the api done in this PR, and set 
the default value as 2.0. In a follow-up PR, we can work on the formula part. 

> Make tree aggregation level in linear/logistic regression configurable
> --
>
> Key: SPARK-17090
> URL: https://issues.apache.org/jira/browse/SPARK-17090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Linear/logistic regression use treeAggregate with default aggregation depth 
> for collecting coefficient gradient updates to the driver. For high 
> dimensional problems, this can case OOM error on the driver. We should make 
> it configurable, perhaps via an expert param, so that users can avoid this 
> problem if their data has many features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core

2016-08-18 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426858#comment-15426858
 ] 

Tejas Patil commented on SPARK-15694:
-

PR for part #1 : https://github.com/apache/spark/pull/14702

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices

2016-08-18 Thread Jon Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426821#comment-15426821
 ] 

Jon Zhong commented on SPARK-17130:
---

Thanks for posting the code. The problem is solved clearly.

> SparseVectors.apply and SparseVectors.toArray have different returns when 
> creating with a illegal indices
> -
>
> Key: SPARK-17130
> URL: https://issues.apache.org/jira/browse/SPARK-17130
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.2, 2.0.0
> Environment: spark 1.6.1 + scala
>Reporter: Jon Zhong
>Priority: Minor
>
> One of my colleagues ran into a bug of SparseVectors. He called the 
> Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without 
> noticing that the indices are assumed to be ordered.
> The vector he created has all value of 0.0 (without any warning), if we try 
> to get value via apply method. However, SparseVector.toArray will generates a 
> array using a method that is order insensitive. Hence, you will get a 0.0 
> when you call apply method, while you can get correct result using toArray or 
> toDense method. The result of SparseVector.toArray is actually misleading.
> It could be safer if there is a validation of indices in the constructor or 
> at least make the returns of apply method and toArray method the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15694) Implement ScriptTransformation in sql/core

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15694:


Assignee: (was: Apache Spark)

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15694) Implement ScriptTransformation in sql/core

2016-08-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426815#comment-15426815
 ] 

Apache Spark commented on SPARK-15694:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/14702

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15694) Implement ScriptTransformation in sql/core

2016-08-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15694:


Assignee: Apache Spark

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426807#comment-15426807
 ] 

Shivaram Venkataraman commented on SPARK-16581:
---

I am not sure the issues are very related though

1. The JVM->R access methods are mostly to call into any Java method (like say 
in SystemML). I think we have reasonable clarity on what to make public here 
which is callJMethod and callJStatic. There is also some discussion on 
supporting custom GC using cleanup.jobj in the SPARK-16611

2. The RDD / RBackend are not directly related to this I think. The RDD ones 
are about our UDFs not having some features right now and we can continue 
discussing that in SPARK-16611 or other JIRAs ?

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-16581) Making JVM backend calling functions public

2016-08-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-16581:
--
Comment: was deleted

(was: I am not sure the issues are very related though

1. The JVM->R access methods are mostly to call into any Java method (like say 
in SystemML). I think we have reasonable clarity on what to make public here 
which is callJMethod and callJStatic. There is also some discussion on 
supporting custom GC using cleanup.jobj in the SPARK-16611

2. The RDD / RBackend are not directly related to this I think. The RDD ones 
are about our UDFs not having some features right now and we can continue 
discussing that in SPARK-16611 or other JIRAs ?)

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426806#comment-15426806
 ] 

Shivaram Venkataraman commented on SPARK-16581:
---

I am not sure the issues are very related though

1. The JVM->R access methods are mostly to call into any Java method (like say 
in SystemML). I think we have reasonable clarity on what to make public here 
which is callJMethod and callJStatic. There is also some discussion on 
supporting custom GC using cleanup.jobj in the SPARK-16611

2. The RDD / RBackend are not directly related to this I think. The RDD ones 
are about our UDFs not having some features right now and we can continue 
discussing that in SPARK-16611 or other JIRAs ?

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-08-18 Thread Iaroslav Zeigerman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426801#comment-15426801
 ] 

Iaroslav Zeigerman commented on SPARK-17131:


Having a different exception when trying to apply mean function to all columns:
{code}
val allCols = df.columns.map(c => mean(c))
val newDf = df.select(allCols: _*)
newDf.show()
{code}

{noformat}
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1383)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:555)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:518)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:185)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
...
{noformat}

> Code generation fails when running SQL expressions against a wide dataset 
> (thousands of columns)
> 
>
> Key: SPARK-17131
> URL: https://issues.apache.org/jira/browse/SPARK-17131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
>
> When reading the CSV file that contains 1776 columns Spark and Janino fail to 
> generate the code with message:
> {noformat}
> Constant pool has grown past JVM limit of 0x
> {noformat}
> When running a common select with all columns it's fine:
> {code}
>   val allCols = df.columns.map(c => col(c).as(c + "_alias"))
>   val newDf = df.select(allCols: _*)
>   newDf.show()
> {code}
> But when I invoke the describe method:
> {code}
> newDf.describe(allCols: _*)
> {code}
> it fails with the following stack trace:
> {noformat}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 30 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has 
> grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
>   at 
> org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
>   at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
>   at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
>   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.ja

[jira] [Commented] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication

2016-08-18 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426798#comment-15426798
 ] 

Shivaram Venkataraman commented on SPARK-6832:
--

I think we can add a new method `readBinFully` and then replace calls to 
`readBin` with that method.

Regarding simulating this -- I think you could try to manually send a signal 
(using something like kill -s SIGCHLD) to an R process while it is reading a 
large amount of data using readBin. 

> Handle partial reads in SparkR JVM to worker communication
> --
>
> Key: SPARK-6832
> URL: https://issues.apache.org/jira/browse/SPARK-6832
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> After we move to use socket between R worker and JVM, it's possible that 
> readBin() in R will return partial results (for example, interrupted by 
> signal).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices

2016-08-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17130.
---
Resolution: Duplicate

Oh yeah but along the way the validation is also all moved into the 
constructor. That was actually the last comment on the PR -- sorry thought 
that's what you saw and were even responding to. See 
https://github.com/apache/spark/pull/14555/files#diff-84f492e3a9c1febe833709960069b1b2R553
  I think the issue was that Vectors.sparse does validate but new 
SparseVector() does not? well, both will be validated now. I'll say this is a 
duplicate because we should definitely resolve both at once.

> SparseVectors.apply and SparseVectors.toArray have different returns when 
> creating with a illegal indices
> -
>
> Key: SPARK-17130
> URL: https://issues.apache.org/jira/browse/SPARK-17130
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.2, 2.0.0
> Environment: spark 1.6.1 + scala
>Reporter: Jon Zhong
>Priority: Minor
>
> One of my colleagues ran into a bug of SparseVectors. He called the 
> Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without 
> noticing that the indices are assumed to be ordered.
> The vector he created has all value of 0.0 (without any warning), if we try 
> to get value via apply method. However, SparseVector.toArray will generates a 
> array using a method that is order insensitive. Hence, you will get a 0.0 
> when you call apply method, while you can get correct result using toArray or 
> toDense method. The result of SparseVector.toArray is actually misleading.
> It could be safer if there is a validation of indices in the constructor or 
> at least make the returns of apply method and toArray method the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices

2016-08-18 Thread Jon Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426780#comment-15426780
 ] 

Jon Zhong commented on SPARK-17130:
---

Yep, I wrote a comment there but I deleted since I'm not sure whether they are 
fixing this problem together.

The problem mentioned at SPARK-16965 is more about negative indices. Are they 
also concerning about unordered indices?

> SparseVectors.apply and SparseVectors.toArray have different returns when 
> creating with a illegal indices
> -
>
> Key: SPARK-17130
> URL: https://issues.apache.org/jira/browse/SPARK-17130
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.2, 2.0.0
> Environment: spark 1.6.1 + scala
>Reporter: Jon Zhong
>Priority: Minor
>
> One of my colleagues ran into a bug of SparseVectors. He called the 
> Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without 
> noticing that the indices are assumed to be ordered.
> The vector he created has all value of 0.0 (without any warning), if we try 
> to get value via apply method. However, SparseVector.toArray will generates a 
> array using a method that is order insensitive. Hence, you will get a 0.0 
> when you call apply method, while you can get correct result using toArray or 
> toDense method. The result of SparseVector.toArray is actually misleading.
> It could be safer if there is a validation of indices in the constructor or 
> at least make the returns of apply method and toArray method the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-08-18 Thread Iaroslav Zeigerman (JIRA)

Iaroslav Zeigerman created SPARK-17131:
--

 Summary: Code generation fails when running SQL expressions 
against a wide dataset (thousands of columns)
 Key: SPARK-17131
 URL: https://issues.apache.org/jira/browse/SPARK-17131
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Iaroslav Zeigerman


When reading the CSV file that contains 1776 columns Spark and Janino fail to 
generate the code with message:
{noformat}
Constant pool has grown past JVM limit of 0x
{noformat}

When running a common select with all columns it's fine:
{code}
  val allCols = df.columns.map(c => col(c).as(c + "_alias"))
  val newDf = df.select(allCols: _*)
  newDf.show()
{code}

But when I invoke the describe method:
{code}
newDf.describe(allCols: _*)
{code}

it fails with the following stack trace:
{noformat}
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 30 more
Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown 
past JVM limit of 0x
at 
org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
at 
org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
at 
org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662)
at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17090) Make tree aggregation level in linear/logistic regression configurable

2016-08-18 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426562#comment-15426562
 ] 

Seth Hendrickson commented on SPARK-17090:
--

I'm not working on it. Please feel free to take it!

> Make tree aggregation level in linear/logistic regression configurable
> --
>
> Key: SPARK-17090
> URL: https://issues.apache.org/jira/browse/SPARK-17090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Linear/logistic regression use treeAggregate with default aggregation depth 
> for collecting coefficient gradient updates to the driver. For high 
> dimensional problems, this can case OOM error on the driver. We should make 
> it configurable, perhaps via an expert param, so that users can avoid this 
> problem if their data has many features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426549#comment-15426549
 ] 

Barry Becker commented on SPARK-17086:
--

I think I agree with the discussion. Here is a summary of the conclusions just 
to check my understanding:
 - It's fine for appxQuantile to return duplicate splits. It should always 
return the requested number of quantiles corresponding to the length of the 
probabilities array pased to it.
 - QuantileBucketizer, on the other hand, may return fewer than the number of 
buckets requested. It should not give an error when the number of buckets 
requested is fewer than the number of distinct values. If the call to 
appxQuartile returns duplicate splits, just discard the duplicates when passing 
the splits to QBucketizer. This saves you from having to compute unique values 
first in order to check to see if that number is less that the requested number 
of bins. I think its fine that QBucketizer work this way. You want it to be 
robust and not give errors for edge cases like this. The objective is to return 
buckets that have as close to equal weight bins as possible with simple split 
values.

If the data was \[1,1,1,1,1,1,1,1,4,5,10\] and I asked for 10 bins, then I 
would expect the splits to be \[-Inf, 1, 4, 5, 10, Inf\] even though the mean 
is 1 and appxQuartile returned 1 repeated several time. If I asked for 2 bins, 
then I think the splits might be \[-Inf, 1, 4, Inf\]. If three bins are 
requested, would you get \[-Inf, 1, 4, 5, Inf] or [-Inf, 1, 4, 10, Inf\]? 
Maybe, in cases like this you should get \[-Inf, 1, 4, 5, 10, Inf\] even though 
only 3 bins were requested. In other words, if there are only a small number of 
unique integer values in the data, and the number of bins is slightly less than 
that number, maybe it should be increased to match it since that is likely to 
be more meaningful. For now, just removing duplicates is probably enough.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> 
>
> Key: SPARK-17086
> URL: https://issues.apache.org/jira/browse/SPARK-17086
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
> .setInputCol("intData")
> .setOutputCol("intData_bin")
> .setNumBuckets(10)
> .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices

2016-08-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426490#comment-15426490
 ] 

Sean Owen commented on SPARK-17130:
---

Yeah, didn't you just comment on https://github.com/apache/spark/pull/14555 ? 
that's already being fixed there. This is a duplicate of SPARK-16965

> SparseVectors.apply and SparseVectors.toArray have different returns when 
> creating with a illegal indices
> -
>
> Key: SPARK-17130
> URL: https://issues.apache.org/jira/browse/SPARK-17130
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.2, 2.0.0
> Environment: spark 1.6.1 + scala
>Reporter: Jon Zhong
>Priority: Minor
>
> One of my colleagues ran into a bug of SparseVectors. He called the 
> Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without 
> noticing that the indices are assumed to be ordered.
> The vector he created has all value of 0.0 (without any warning), if we try 
> to get value via apply method. However, SparseVector.toArray will generates a 
> array using a method that is order insensitive. Hence, you will get a 0.0 
> when you call apply method, while you can get correct result using toArray or 
> toDense method. The result of SparseVector.toArray is actually misleading.
> It could be safer if there is a validation of indices in the constructor or 
> at least make the returns of apply method and toArray method the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17130) SparseVectors.apply and SparseVectors.toArray have different returns when creating with a illegal indices

2016-08-18 Thread Jon Zhong (JIRA)

Jon Zhong created SPARK-17130:
-

 Summary: SparseVectors.apply and SparseVectors.toArray have 
different returns when creating with a illegal indices
 Key: SPARK-17130
 URL: https://issues.apache.org/jira/browse/SPARK-17130
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.0.0, 1.6.2
 Environment: spark 1.6.1 + scala
Reporter: Jon Zhong
Priority: Minor


One of my colleagues ran into a bug of SparseVectors. He called the 
Vectors.sparse(size: Int, indices: Array[Int], values: Array[Double]) without 
noticing that the indices are assumed to be ordered.

The vector he created has all value of 0.0 (without any warning), if we try to 
get value via apply method. However, SparseVector.toArray will generates a 
array using a method that is order insensitive. Hence, you will get a 0.0 when 
you call apply method, while you can get correct result using toArray or 
toDense method. The result of SparseVector.toArray is actually misleading.



It could be safer if there is a validation of indices in the constructor or at 
least make the returns of apply method and toArray method the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 158 matches

Mail list logo