[jira] [Resolved] (SPARK-7228) SparkR public API for 1.4 release

2015-05-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-7228.
--
Resolution: Fixed

Resolving as all the sub-tasks have been now resolved

> SparkR public API for 1.4 release
> -
>
> Key: SPARK-7228
> URL: https://issues.apache.org/jira/browse/SPARK-7228
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Blocker
>
> This in an umbrella ticket to track the public APIs and documentation to be 
> released as a part of SparkR in the 1.4 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536227#comment-14536227
 ] 

Apache Spark commented on SPARK-3928:
-

User 'tkyaw' has created a pull request for this issue:
https://github.com/apache/spark/pull/6025

> Support wildcard matches on Parquet files
> -
>
> Key: SPARK-3928
> URL: https://issues.apache.org/jira/browse/SPARK-3928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.3.0
>
>
> {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
> {{2014-\?\?-\?\?}}. 
> It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7468) DAG visualization: certain action operators should not be scopes

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7468.

Resolution: Won't Fix

It's hard to apply a general rule that decides whether an operation should be 
marked "withScope". Even an action may be marked as such in case we make it use 
an operation internally in the future (e.g. if we change `take` implementation 
to use a random `map` or `filter` somewhere).

Closing as won't fix because in the worst case we show a higher level 
construct, which is familiar to the user, rather than a low level operation 
used internally (e.g. otherwise `map` shows up on the UI even though the user 
didn't explicitly call `map`).

> DAG visualization: certain action operators should not be scopes
> 
>
> Key: SPARK-7468
> URL: https://issues.apache.org/jira/browse/SPARK-7468
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> What does it mean to have a "take" scope and an RDD in it? This is somewhat 
> confusing. Low hanging fruit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7498) Params.setDefault should not use varargs annotation

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7498.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6021
[https://github.com/apache/spark/pull/6021]

> Params.setDefault should not use varargs annotation
> ---
>
> Key: SPARK-7498
> URL: https://issues.apache.org/jira/browse/SPARK-7498
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.4.0
>
>
> In [SPARK-7429] and PR [https://github.com/apache/spark/pull/5960], I added 
> the varargs annotation to Params.setDefault which takes a variable number of 
> ParamPairs.  It worked locally and on Jenkins for me.
> However, @mengxr reported issues compiling on his machine.  So I'm reverting 
> the change introduced in [https://github.com/apache/spark/pull/5960] by 
> removing varargs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7262:
-
Assignee: DB Tsai

> Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML 
> package
> 
>
> Key: SPARK-7262
> URL: https://issues.apache.org/jira/browse/SPARK-7262
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.4.0
>
>
> 1) Handle scaling and addBias internally. 
> 2) L1/L2 elasticnet using OWLQN optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7262.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5967
[https://github.com/apache/spark/pull/5967]

> Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML 
> package
> 
>
> Key: SPARK-7262
> URL: https://issues.apache.org/jira/browse/SPARK-7262
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: DB Tsai
> Fix For: 1.4.0
>
>
> 1) Handle scaling and addBias internally. 
> 2) L1/L2 elasticnet using OWLQN optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7463) Dag visualization improvements

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7463:
-
Target Version/s: 1.4.0, 1.5.0  (was: 1.5.0)

> Dag visualization improvements
> --
>
> Key: SPARK-7463
> URL: https://issues.apache.org/jira/browse/SPARK-7463
> Project: Spark
>  Issue Type: Umbrella
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is the umbrella JIRA for improvements or bug fixes to the DAG 
> visualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7502) DAG visualization: handle removed stages gracefully

2015-05-08 Thread Andrew Or (JIRA)
Andrew Or created SPARK-7502:


 Summary: DAG visualization: handle removed stages gracefully
 Key: SPARK-7502
 URL: https://issues.apache.org/jira/browse/SPARK-7502
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or


Right now we get a blank viz in the job page if this happens. Then the JS error 
message in the developer console looks something like "Warning: SVG view box 
cannot be 'Nan Nan Nan Nan'".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536114#comment-14536114
 ] 

Yana Kadiyska commented on SPARK-3928:
--

[~tkyaw] Your suggested workaround does work. One question though -- what are 
the implications of turning off "spark.sql.parquet.useDataSourceApi"? My 
particular concern is with predicate pushdowns into parquet -- am I going to 
lose these (it's hard to tell from the UI if pushdown is happening correctly). 

Also, can you clarify if you still plan to fix this for 1.4 or "New parquet 
implementation does not contain wild card support yet" means that we'd have to 
live with spark.sql.parquet.useDataSourceApi until further time?

> Support wildcard matches on Parquet files
> -
>
> Key: SPARK-3928
> URL: https://issues.apache.org/jira/browse/SPARK-3928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.3.0
>
>
> {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
> {{2014-\?\?-\?\?}}. 
> It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7501) DAG visualization: show DStream operations for Streaming

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7501:
-
Description: Similar to SQL in SPARK-7469, we should show higher level 
constructs that the user is more familiar with.  (was: Similar to SQL, we 
should show higher level constructs that the user is more familiar with.)

> DAG visualization: show DStream operations for Streaming
> 
>
> Key: SPARK-7501
> URL: https://issues.apache.org/jira/browse/SPARK-7501
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Similar to SQL in SPARK-7469, we should show higher level constructs that the 
> user is more familiar with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7501) DAG visualization: show DStream operations for Streaming

2015-05-08 Thread Andrew Or (JIRA)
Andrew Or created SPARK-7501:


 Summary: DAG visualization: show DStream operations for Streaming
 Key: SPARK-7501
 URL: https://issues.apache.org/jira/browse/SPARK-7501
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or


Similar to SQL, we should show higher level constructs that the user is more 
familiar with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7500:
-
Attachment: long names.png

> DAG visualization: cluster name bleeds beyond the cluster
> -
>
> Key: SPARK-7500
> URL: https://issues.apache.org/jira/browse/SPARK-7500
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Attachments: long names.png
>
>
> This happens only for long names. See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7375) Avoid defensive copying in SQL exchange operator when sort-based shuffle buffers data in serialized form

2015-05-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7375.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5948
[https://github.com/apache/spark/pull/5948]

> Avoid defensive copying in SQL exchange operator when sort-based shuffle 
> buffers data in serialized form
> 
>
> Key: SPARK-7375
> URL: https://issues.apache.org/jira/browse/SPARK-7375
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.4.0
>
>
> The original sort-based shuffle buffers shuffle input records in memory while 
> sorting them. This causes problems when mutable records are presented to the 
> shuffle, which happens in Spark SQL's Exchange operator. To work around this 
> issue, SPARK-2967 and SPARK-4479 added defensive copying of shuffle inputs in 
> the Exchange operator when sort-based shuffle is enabled.
> I think that [~sandyr]'s recent patch for enabling serialization of records 
> in sort-based shuffle (SPARK-4550) and my proposed {{unsafe}}-based shuffle 
> path (SPARK-7081) may allow us to avoid this defensive copying in certain 
> cases (since our patches cause records to be serialized one-at-a-time and 
> remove the buffering of deserialized records).
> As mentioned in SPARK-4479, a long-term fix for this issue might be to add 
> hooks for informing the shuffle about object (im)mutability in order to allow 
> the shuffle layer to decide whether to copy. In the meantime, though, I think 
> that we should just extend the checks added in SPARK-4479 to avoid copies 
> when these new serialized sort paths are used.
> /cc [~rxin] [~marmbrus] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster

2015-05-08 Thread Andrew Or (JIRA)
Andrew Or created SPARK-7500:


 Summary: DAG visualization: cluster name bleeds beyond the cluster
 Key: SPARK-7500
 URL: https://issues.apache.org/jira/browse/SPARK-7500
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor
 Attachments: long names.png

This happens only for long names. See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7490.

  Resolution: Fixed
Target Version/s: 1.4.0

> MapOutputTracker: close input streams to free native memory
> ---
>
> Key: SPARK-7490
> URL: https://issues.apache.org/jira/browse/SPARK-7490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Evan Jones
>Assignee: Evan Jones
>Priority: Minor
> Fix For: 1.4.0
>
>
> GZIPInputStream allocates native memory that is not freed until close() or 
> when the finalizer runs. It is best to close() these streams explicitly to 
> avoid native memory leaks
> Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-7490:
--

> MapOutputTracker: close input streams to free native memory
> ---
>
> Key: SPARK-7490
> URL: https://issues.apache.org/jira/browse/SPARK-7490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Evan Jones
>Assignee: Evan Jones
>Priority: Minor
> Fix For: 1.4.0
>
>
> GZIPInputStream allocates native memory that is not freed until close() or 
> when the finalizer runs. It is best to close() these streams explicitly to 
> avoid native memory leaks
> Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7490:
-
Fix Version/s: (was: 1.2.3)
   (was: 1.3.2)

> MapOutputTracker: close input streams to free native memory
> ---
>
> Key: SPARK-7490
> URL: https://issues.apache.org/jira/browse/SPARK-7490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Evan Jones
>Assignee: Evan Jones
>Priority: Minor
> Fix For: 1.4.0
>
>
> GZIPInputStream allocates native memory that is not freed until close() or 
> when the finalizer runs. It is best to close() these streams explicitly to 
> avoid native memory leaks
> Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7427) Make sharedParams match in Scala, Python

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7427:
---

Assignee: (was: Apache Spark)

> Make sharedParams match in Scala, Python
> 
>
> Key: SPARK-7427
> URL: https://issues.apache.org/jira/browse/SPARK-7427
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> The documentation for shared Params differs a little between Scala, Python.  
> The Python docs should be modified to match the Scala ones.  This will 
> require modifying the sharedParamsCodeGen files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7427) Make sharedParams match in Scala, Python

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7427:
---

Assignee: Apache Spark

> Make sharedParams match in Scala, Python
> 
>
> Key: SPARK-7427
> URL: https://issues.apache.org/jira/browse/SPARK-7427
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> The documentation for shared Params differs a little between Scala, Python.  
> The Python docs should be modified to match the Scala ones.  This will 
> require modifying the sharedParamsCodeGen files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7427) Make sharedParams match in Scala, Python

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536108#comment-14536108
 ] 

Apache Spark commented on SPARK-7427:
-

User 'gweidner' has created a pull request for this issue:
https://github.com/apache/spark/pull/6023

> Make sharedParams match in Scala, Python
> 
>
> Key: SPARK-7427
> URL: https://issues.apache.org/jira/browse/SPARK-7427
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> The documentation for shared Params differs a little between Scala, Python.  
> The Python docs should be modified to match the Scala ones.  This will 
> require modifying the sharedParamsCodeGen files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly

2015-05-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-7231:
-
Fix Version/s: 1.4.0

> Make SparkR DataFrame API more dplyr friendly
> -
>
> Key: SPARK-7231
> URL: https://issues.apache.org/jira/browse/SPARK-7231
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Critical
> Fix For: 1.4.0
>
>
> This ticket tracks auditing the SparkR dataframe API and ensuring that the 
> API is friendly to existing R users. 
> Mainly we wish to make sure the DataFrame API we expose has functions similar 
> to those which exist on native R data frames and in popular packages like 
> `dplyr`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly

2015-05-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536091#comment-14536091
 ] 

Shivaram Venkataraman commented on SPARK-7231:
--

Fixed by https://github.com/apache/spark/pull/6005

> Make SparkR DataFrame API more dplyr friendly
> -
>
> Key: SPARK-7231
> URL: https://issues.apache.org/jira/browse/SPARK-7231
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Critical
>
> This ticket tracks auditing the SparkR dataframe API and ensuring that the 
> API is friendly to existing R users. 
> Mainly we wish to make sure the DataFrame API we expose has functions similar 
> to those which exist on native R data frames and in popular packages like 
> `dplyr`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly

2015-05-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-7231.
--
Resolution: Fixed

> Make SparkR DataFrame API more dplyr friendly
> -
>
> Key: SPARK-7231
> URL: https://issues.apache.org/jira/browse/SPARK-7231
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Critical
>
> This ticket tracks auditing the SparkR dataframe API and ensuring that the 
> API is friendly to existing R users. 
> Mainly we wish to make sure the DataFrame API we expose has functions similar 
> to those which exist on native R data frames and in popular packages like 
> `dplyr`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7497) test_count_by_value_and_window is flaky

2015-05-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536060#comment-14536060
 ] 

Shivaram Venkataraman commented on SPARK-7497:
--

I've seen this a couple of times as well very recently

> test_count_by_value_and_window is flaky
> ---
>
> Key: SPARK-7497
> URL: https://issues.apache.org/jira/browse/SPARK-7497
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> Saw this test failure in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console
> {code}
> ==
> FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests)
> --
> Traceback (most recent call last):
>   File "pyspark/streaming/tests.py", line 418, in 
> test_count_by_value_and_window
> self._test_func(input, func, expected)
>   File "pyspark/streaming/tests.py", line 133, in _test_func
> self.assertEqual(expected, result)
> AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], 
> [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]]
> First list contains 2 additional elements.
> First extra element 8:
> [6]
> - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]]
> ? --
> + [[1], [2], [3], [4], [5], [6], [6], [6]]
> --
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7473:
---

Assignee: (was: Apache Spark)

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536058#comment-14536058
 ] 

Apache Spark commented on SPARK-7473:
-

User 'AiHe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5988

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7473) Use reservoir sample in RandomForest when choosing features per node

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7473:
---

Assignee: Apache Spark

> Use reservoir sample in RandomForest when choosing features per node
> 
>
> Key: SPARK-7473
> URL: https://issues.apache.org/jira/browse/SPARK-7473
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>
> See sampling in selectNodesToSplit method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7485) Remove python artifacts from the assembly jar

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536026#comment-14536026
 ] 

Apache Spark commented on SPARK-7485:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6022

> Remove python artifacts from the assembly jar
> -
>
> Key: SPARK-7485
> URL: https://issues.apache.org/jira/browse/SPARK-7485
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
>Reporter: Thomas Graves
>
> We change it so that we distributed the python files via a zip file in 
> SPARK-6869.  With that we should remove the python files from the assembly 
> jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7485) Remove python artifacts from the assembly jar

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7485:
---

Assignee: (was: Apache Spark)

> Remove python artifacts from the assembly jar
> -
>
> Key: SPARK-7485
> URL: https://issues.apache.org/jira/browse/SPARK-7485
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
>Reporter: Thomas Graves
>
> We change it so that we distributed the python files via a zip file in 
> SPARK-6869.  With that we should remove the python files from the assembly 
> jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7485) Remove python artifacts from the assembly jar

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7485:
---

Assignee: Apache Spark

> Remove python artifacts from the assembly jar
> -
>
> Key: SPARK-7485
> URL: https://issues.apache.org/jira/browse/SPARK-7485
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> We change it so that we distributed the python files via a zip file in 
> SPARK-6869.  With that we should remove the python files from the assembly 
> jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7488) Python API for ml.recommendation

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7488.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6015
[https://github.com/apache/spark/pull/6015]

> Python API for ml.recommendation
> 
>
> Key: SPARK-7488
> URL: https://issues.apache.org/jira/browse/SPARK-7488
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7498) Params.setDefault should not use varargs annotation

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7498:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

> Params.setDefault should not use varargs annotation
> ---
>
> Key: SPARK-7498
> URL: https://issues.apache.org/jira/browse/SPARK-7498
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> In [SPARK-7429] and PR [https://github.com/apache/spark/pull/5960], I added 
> the varargs annotation to Params.setDefault which takes a variable number of 
> ParamPairs.  It worked locally and on Jenkins for me.
> However, @mengxr reported issues compiling on his machine.  So I'm reverting 
> the change introduced in [https://github.com/apache/spark/pull/5960] by 
> removing varargs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7498) Params.setDefault should not use varargs annotation

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7498:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

> Params.setDefault should not use varargs annotation
> ---
>
> Key: SPARK-7498
> URL: https://issues.apache.org/jira/browse/SPARK-7498
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> In [SPARK-7429] and PR [https://github.com/apache/spark/pull/5960], I added 
> the varargs annotation to Params.setDefault which takes a variable number of 
> ParamPairs.  It worked locally and on Jenkins for me.
> However, @mengxr reported issues compiling on his machine.  So I'm reverting 
> the change introduced in [https://github.com/apache/spark/pull/5960] by 
> removing varargs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7498) Params.setDefault should not use varargs annotation

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536019#comment-14536019
 ] 

Apache Spark commented on SPARK-7498:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/6021

> Params.setDefault should not use varargs annotation
> ---
>
> Key: SPARK-7498
> URL: https://issues.apache.org/jira/browse/SPARK-7498
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> In [SPARK-7429] and PR [https://github.com/apache/spark/pull/5960], I added 
> the varargs annotation to Params.setDefault which takes a variable number of 
> ParamPairs.  It worked locally and on Jenkins for me.
> However, @mengxr reported issues compiling on his machine.  So I'm reverting 
> the change introduced in [https://github.com/apache/spark/pull/5960] by 
> removing varargs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7488) Python API for ml.recommendation

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7488:
-
Assignee: Burak Yavuz

> Python API for ml.recommendation
> 
>
> Key: SPARK-7488
> URL: https://issues.apache.org/jira/browse/SPARK-7488
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings

2015-05-08 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-7499:


 Summary: Investigate how to specify columns in SparkR without $ or 
strings
 Key: SPARK-7499
 URL: https://issues.apache.org/jira/browse/SPARK-7499
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman


Right now in SparkR we need to specify the columns used using `$` or strings. 
For example to run select we would do

{code}
df1 <- select(df, df$age > 10)
{code}

It would be good to infer the set of columns in a dataframe automatically and 
resolve symbols for column names. For example

{code} 
df1 <- select(df, age > 10)
{code}

One way to do this is to build an environment with all the column names to 
column handles and then use `substitute(arg, env = columnNameEnv)`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6955.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Aaron Davidson  (was: SaintBacchus)

> Do not let Yarn Shuffle Server retry its server port.
> -
>
> Key: SPARK-6955
> URL: https://issues.apache.org/jira/browse/SPARK-6955
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.2.0
>Reporter: SaintBacchus
>Assignee: Aaron Davidson
>Priority: Minor
> Fix For: 1.4.0
>
>
>  It's better to let the NodeManager get down rather than take a port retry 
> when `spark.shuffle.service.port` has been conflicted during starting the 
> Spark Yarn Shuffle Server, because the retry mechanism will make the 
> inconsistency of shuffle port and also make client fail to find the port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7427) Make sharedParams match in Scala, Python

2015-05-08 Thread Glenn Weidner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535905#comment-14535905
 ] 

Glenn Weidner commented on SPARK-7427:
--

Thank you!  I see mismatch now - for example:

HasMaxIter in Scala:
"max number of iterations (>= 0)"

HasMaxIter in Python:
"max number of iterations"

I'll modify _shared_params_code_gen.py based on doc strings in 
SharedParamsCodeGen.scala and regenerate shared.py.

> Make sharedParams match in Scala, Python
> 
>
> Key: SPARK-7427
> URL: https://issues.apache.org/jira/browse/SPARK-7427
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> The documentation for shared Params differs a little between Scala, Python.  
> The Python docs should be modified to match the Scala ones.  This will 
> require modifying the sharedParamsCodeGen files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7498) Params.setDefault should not use varargs annotation

2015-05-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7498:


 Summary: Params.setDefault should not use varargs annotation
 Key: SPARK-7498
 URL: https://issues.apache.org/jira/browse/SPARK-7498
 Project: Spark
  Issue Type: Bug
  Components: Java API, ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


In [SPARK-7429] and PR [https://github.com/apache/spark/pull/5960], I added the 
varargs annotation to Params.setDefault which takes a variable number of 
ParamPairs.  It worked locally and on Jenkins for me.

However, @mengxr reported issues compiling on his machine.  So I'm reverting 
the change introduced in [https://github.com/apache/spark/pull/5960] by 
removing varargs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5913) Python API for ChiSqSelector

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5913.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5939
[https://github.com/apache/spark/pull/5939]

> Python API for ChiSqSelector
> 
>
> Key: SPARK-5913
> URL: https://issues.apache.org/jira/browse/SPARK-5913
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.4.0
>
>
> Add a Python API for mllib.feature.ChiSqSelector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7427) Make sharedParams match in Scala, Python

2015-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535800#comment-14535800
 ] 

Joseph K. Bradley commented on SPARK-7427:
--

Each parameter has built-in documentation passed to the Param constructor.  
That doc should match between Scala and Python.

> Make sharedParams match in Scala, Python
> 
>
> Key: SPARK-7427
> URL: https://issues.apache.org/jira/browse/SPARK-7427
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> The documentation for shared Params differs a little between Scala, Python.  
> The Python docs should be modified to match the Scala ones.  This will 
> require modifying the sharedParamsCodeGen files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7497) test_count_by_value_and_window is flaky

2015-05-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7497:


 Summary: test_count_by_value_and_window is flaky
 Key: SPARK-7497
 URL: https://issues.apache.org/jira/browse/SPARK-7497
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Tathagata Das


Saw this test failure in 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console

{code}
==
FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests)
--
Traceback (most recent call last):
  File "pyspark/streaming/tests.py", line 418, in test_count_by_value_and_window
self._test_func(input, func, expected)
  File "pyspark/streaming/tests.py", line 133, in _test_func
self.assertEqual(expected, result)
AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], 
[6]] != [[1], [2], [3], [4], [5], [6], [6], [6]]

First list contains 2 additional elements.
First extra element 8:
[6]

- [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]]
? --

+ [[1], [2], [3], [4], [5], [6], [6], [6]]

--
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala

2015-05-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4699.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5806
[https://github.com/apache/spark/pull/5806]

> Make caseSensitive configurable in Analyzer.scala
> -
>
> Key: SPARK-4699
> URL: https://issues.apache.org/jira/browse/SPARK-4699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jacky Li
> Fix For: 1.4.0
>
>
> Currently, case sensitivity is true by default in Analyzer. It should be 
> configurable by setting SQLConf in the client application



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7407) Use uid and param name to identify a parameter instead of the param object

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535754#comment-14535754
 ] 

Apache Spark commented on SPARK-7407:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6019

> Use uid and param name to identify a parameter instead of the param object
> --
>
> Key: SPARK-7407
> URL: https://issues.apache.org/jira/browse/SPARK-7407
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Transferring parameter values from one to another have been the pain point in 
> the ML pipeline implementation. Because we use the param object as the key in 
> the param map, we have to correctly copy them when making a copy of the 
> transformer, estimator, and models. This becomes complicated when 
> meta-algorithms are involved. For example, in cross validation:
> {code}
> val cv = new CrossValidator()
>   .setEstimator(lr)
>   .setEstimatorParamMaps(epm)
> {code}
> When we make a copy of `cv` with extra params that contain estimator params,
> {code}
> cv.copy(ParamMap(cv.numFolds -> 3, lr.maxIter -> 10))
> {code}
> we need to make a copy of the `lr` object as well and map `epm` to use the 
> new param keys from the old `lr`. This is quite error-prone, especially if 
> the estimator itself is another meta-algorithm.
> Using uid + param name as the key in param maps and using the same uid in 
> copy (and between estimator/model pairs) would simplify the implementations. 
> We don't need to change the keys since the copied instance has the same id as 
> the original instance. And it is easier to find models from a fitted pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7427) Make sharedParams match in Scala, Python

2015-05-08 Thread Glenn Weidner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535701#comment-14535701
 ] 

Glenn Weidner commented on SPARK-7427:
--

I've also generated the scaladoc by running build/sbt unidoc to compare against 
generated Python API docs.  When all the individual shared parameters (e.g., 
HasInputCol) in sharedParams.scala created by SharedParamsCodeGen.scala are 
private, then no html is generated.  If public, then the corresponding html is 
available in browser along with Param, Params, ParamMap, etc. under 
org.apache.spark.ml.param.

[~josephkb] Can you provide a little more description regarding the 
"documentation for shared Params differs" between Scala and Python?

I'm double-checking that my forked repository is in sync with latest from 
master since I only found sections for feature, classification, tuning, 
evaluation modules under pyspark.ml in my generated Python API docs.

> Make sharedParams match in Scala, Python
> 
>
> Key: SPARK-7427
> URL: https://issues.apache.org/jira/browse/SPARK-7427
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> The documentation for shared Params differs a little between Scala, Python.  
> The Python docs should be modified to match the Scala ones.  This will 
> require modifying the sharedParamsCodeGen files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Thu Kyaw (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534582#comment-14534582
 ] 

Thu Kyaw edited comment on SPARK-3928 at 5/8/15 9:52 PM:
-

New parquet implementation does not contain wild card support yet, but you 
could still use old version of parquet implementation to get wildcard support. 
Just turn off sql Configuration. "spark.sql.parquet.useDataSourceApi" ( turn 
off by setting it false; by default it is true ).


was (Author: tkyaw):
New parquet implementation does not contain wild card support yet, but you 
could still use old version parquet implementation to get wildcard support. 
Just turn off sql Configuration. "spark.sql.parquet.useDataSourceApi" ( to 
false by default it is true ).

> Support wildcard matches on Parquet files
> -
>
> Key: SPARK-3928
> URL: https://issues.apache.org/jira/browse/SPARK-3928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.3.0
>
>
> {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
> {{2014-\?\?-\?\?}}. 
> It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3928) Support wildcard matches on Parquet files

2015-05-08 Thread Thu Kyaw (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534582#comment-14534582
 ] 

Thu Kyaw edited comment on SPARK-3928 at 5/8/15 9:50 PM:
-

New parquet implementation does not contain wild card support yet, but you 
could still use old version parquet implementation to get wildcard support. 
Just turn off sql Configuration. "spark.sql.parquet.useDataSourceApi" ( to 
false by default it is true ).


was (Author: tkyaw):
Hello [~lian cheng] please let me know if you want me to work on adding back 
the glob support.

> Support wildcard matches on Parquet files
> -
>
> Key: SPARK-3928
> URL: https://issues.apache.org/jira/browse/SPARK-3928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.3.0
>
>
> {{SparkContext.textFile()}} supports patterns like {{part-*}} and 
> {{2014-\?\?-\?\?}}. 
> It would be nice if {{SparkContext.parquetFile()}} did the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7486) Add the streaming implementation for estimating quantiles and median

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-7486.

Resolution: Duplicate

> Add the streaming implementation for estimating quantiles and median
> 
>
> Key: SPARK-7486
> URL: https://issues.apache.org/jira/browse/SPARK-7486
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Liang-Chi Hsieh
>
> Streaming implementations that can estimate quantiles, median are very useful 
> for ML algorithm and data statistics. 
> Apache DataFu Pig has this kind of implementation. We can port it to Spark. 
> Please refer to: 
> http://datafu.incubator.apache.org/docs/datafu/getting-started.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7486) Add the streaming implementation for estimating quantiles and median

2015-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535642#comment-14535642
 ] 

Joseph K. Bradley commented on SPARK-7486:
--

OK, I'll close it as a duplicate

> Add the streaming implementation for estimating quantiles and median
> 
>
> Key: SPARK-7486
> URL: https://issues.apache.org/jira/browse/SPARK-7486
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Liang-Chi Hsieh
>
> Streaming implementations that can estimate quantiles, median are very useful 
> for ML algorithm and data statistics. 
> Apache DataFu Pig has this kind of implementation. We can port it to Spark. 
> Please refer to: 
> http://datafu.incubator.apache.org/docs/datafu/getting-started.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-05-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1517:
--
Assignee: (was: Nicholas Chammas)

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Priority: Blocker
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7390) CovarianceCounter in StatFunctions might calculate incorrect result

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7390:
-
Assignee: Liang-Chi Hsieh

> CovarianceCounter in StatFunctions might calculate incorrect result
> ---
>
> Key: SPARK-7390
> URL: https://issues.apache.org/jira/browse/SPARK-7390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> CovarianceCounter in StatFunctions has a merging stage. In this merge 
> function, the other CovarianceCounter object sometimes has zero count that 
> causes the final CovarianceCounter with incorrect result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7390) CovarianceCounter in StatFunctions might calculate incorrect result

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7390.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5931
[https://github.com/apache/spark/pull/5931]

> CovarianceCounter in StatFunctions might calculate incorrect result
> ---
>
> Key: SPARK-7390
> URL: https://issues.apache.org/jira/browse/SPARK-7390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> CovarianceCounter in StatFunctions has a merging stage. In this merge 
> function, the other CovarianceCounter object sometimes has zero count that 
> causes the final CovarianceCounter with incorrect result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7137) Add checkInputColumn back to Params and print more info

2015-05-08 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535624#comment-14535624
 ] 

Rekha Joshi commented on SPARK-7137:


Sorry [~gweidner] , [~josephkb] just saw it was unassigned when i created the 
patch.thanks

> Add checkInputColumn back to Params and print more info
> ---
>
> Key: SPARK-7137
> URL: https://issues.apache.org/jira/browse/SPARK-7137
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> In the PR for [https://issues.apache.org/jira/browse/SPARK-5957], 
> Params.checkInputColumn was moved to SchemaUtils and renamed to 
> checkColumnType.  The downside is that it no longer has access to the 
> parameter info, so it cannot state which input column parameter was incorrect.
> We should keep checkColumnType but also add checkInputColumn back to Params.  
> It should print out the parameter name and description.  Internally, it may 
> call checkColumnType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.

2015-05-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2572.
--
Resolution: Duplicate

> Can't delete local dir on executor automatically when running spark over 
> Mesos.
> ---
>
> Key: SPARK-2572
> URL: https://issues.apache.org/jira/browse/SPARK-2572
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Yadong Qi
>Priority: Minor
>
> When running spark over Mesos in “fine-grained” modes or “coarse-grained” 
> mode. After the application finished.The local 
> dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
> automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7486) Add the streaming implementation for estimating quantiles and median

2015-05-08 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535616#comment-14535616
 ] 

Burak Yavuz commented on SPARK-7486:


Yes, this is a clone of SPARK-6760 and SPARK-7246 (kinda). However it will be 
in Spark 1.5.

> Add the streaming implementation for estimating quantiles and median
> 
>
> Key: SPARK-7486
> URL: https://issues.apache.org/jira/browse/SPARK-7486
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Liang-Chi Hsieh
>
> Streaming implementations that can estimate quantiles, median are very useful 
> for ML algorithm and data statistics. 
> Apache DataFu Pig has this kind of implementation. We can port it to Spark. 
> Please refer to: 
> http://datafu.incubator.apache.org/docs/datafu/getting-started.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7245) Spearman correlation for DataFrames

2015-05-08 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-7245.

   Resolution: Done
Fix Version/s: 1.4.0

> Spearman correlation for DataFrames
> ---
>
> Key: SPARK-7245
> URL: https://issues.apache.org/jira/browse/SPARK-7245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Spearman correlation is harder than Pearson to compute.
> ~~~
> df.stat.corr(col1, col2, method="spearman"): Double
> ~~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7399) Master fails on 2.11 with compilation error

2015-05-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7399.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Tijo Thomas

> Master fails on 2.11 with compilation error
> ---
>
> Key: SPARK-7399
> URL: https://issues.apache.org/jira/browse/SPARK-7399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Iulian Dragos
>Assignee: Tijo Thomas
> Fix For: 1.4.0
>
>
> The current code in master (and 1.4 branch) fails on 2.11 with the following 
> compilation error:
> {code}
> [error] /home/ubuntu/workspace/Apache Spark (master) on 
> 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in 
> object RDDOperationScope, multiple overloaded alternatives of method 
> withScope define default arguments.
> [error] private[spark] object RDDOperationScope {
> [error]   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7245) Spearman correlation for DataFrames

2015-05-08 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reopened SPARK-7245:


Sorry, mixed this with Pearson correlation

> Spearman correlation for DataFrames
> ---
>
> Key: SPARK-7245
> URL: https://issues.apache.org/jira/browse/SPARK-7245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Spearman correlation is harder than Pearson to compute.
> ~~~
> df.stat.corr(col1, col2, method="spearman"): Double
> ~~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7435) Make DataFrame.show() consistent with that of Scala and pySpark

2015-05-08 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535614#comment-14535614
 ] 

Rekha Joshi commented on SPARK-7435:


Thank you [~shivaram] and [~sunrui] for quick reply and good discussion.Updated 
git patch for review comments.thanks

> Make DataFrame.show() consistent with that of Scala and pySpark
> ---
>
> Key: SPARK-7435
> URL: https://issues.apache.org/jira/browse/SPARK-7435
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Sun Rui
>Priority: Critical
>
> Currently in SparkR, DataFrame has two methods show() and showDF(). show() 
> prints the DataFrame column names and types and showDF() prints the first 
> numRows rows of a DataFrame.
> In Scala and pySpark, show() is used to prints rows of a DataFrame. 
> We'd better keep API consistent unless there is some important reason. So 
> propose to interchange the names (show() and showDF()) in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7486) Add the streaming implementation for estimating quantiles and median

2015-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535586#comment-14535586
 ] 

Joseph K. Bradley commented on SPARK-7486:
--

Ping [~brkyvz] Aren't you looking at something like this?

> Add the streaming implementation for estimating quantiles and median
> 
>
> Key: SPARK-7486
> URL: https://issues.apache.org/jira/browse/SPARK-7486
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Liang-Chi Hsieh
>
> Streaming implementations that can estimate quantiles, median are very useful 
> for ML algorithm and data statistics. 
> Apache DataFu Pig has this kind of implementation. We can port it to Spark. 
> Please refer to: 
> http://datafu.incubator.apache.org/docs/datafu/getting-started.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535577#comment-14535577
 ] 

Joseph K. Bradley commented on SPARK-7483:
--

Does it fix anything if you give Kryo more info, such as explicit registration 
of relevant classes?

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-7483:
--

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535562#comment-14535562
 ] 

Joseph K. Bradley edited comment on SPARK-7483 at 5/8/15 9:24 PM:
--

(Updated) Maybe this is a bug...will look into it.



was (Author: josephkb):
I believe this is because it would need a custom serializer.  Not all classes 
in Spark work with Kryo out of the box.  But if you want to learn more and 
write your own, please check out: 
[http://spark.apache.org/docs/latest/tuning.html#data-serialization]

Also, this kind of question should probably go to the user list before JIRA.  
I'll close this, but if you think I'm wrong, please bring up the issue on the 
user list!  Thanks

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-7483.

Resolution: Not A Problem

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535562#comment-14535562
 ] 

Joseph K. Bradley commented on SPARK-7483:
--

I believe this is because it would need a custom serializer.  Not all classes 
in Spark work with Kryo out of the box.  But if you want to learn more and 
write your own, please check out: 
[http://spark.apache.org/docs/latest/tuning.html#data-serialization]

Also, this kind of question should probably go to the user list before JIRA.  
I'll close this, but if you think I'm wrong, please bring up the issue on the 
user list!  Thanks

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-05-08 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535563#comment-14535563
 ] 

Tathagata Das commented on SPARK-6613:
--

Any update with 1.3.1?

> Starting stream from checkpoint causes Streaming tab to throw error
> ---
>
> Key: SPARK-6613
> URL: https://issues.apache.org/jira/browse/SPARK-6613
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1, 1.2.2
>Reporter: Marius Soutier
>
> When continuing my streaming job from a checkpoint, the job runs, but the 
> Streaming tab in the standard UI initially no longer works (browser just 
> shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
> sometimes it stays in this state permanently.
> Stacktrace:
> WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
> java.util.NoSuchElementException: key not found: 0
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:58)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> 

[jira] [Commented] (SPARK-2572) Can't delete local dir on executor automatically when running spark over Mesos.

2015-05-08 Thread Prasanna Gautam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535557#comment-14535557
 ] 

Prasanna Gautam commented on SPARK-2572:


This is still happening as of Spark-1.3.0 with pySpark, when the context is 
closed the files aren't deleted. Neither does sc.clearFiles() seem to remove 
the /tmp/spark-* directories.

> Can't delete local dir on executor automatically when running spark over 
> Mesos.
> ---
>
> Key: SPARK-2572
> URL: https://issues.apache.org/jira/browse/SPARK-2572
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Yadong Qi
>Priority: Minor
>
> When running spark over Mesos in “fine-grained” modes or “coarse-grained” 
> mode. After the application finished.The local 
> dir(/tmp/spark-local-20140718114058-834c) on executor can't not delete 
> automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming

2015-05-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7398:
---
Issue Type: Improvement  (was: Bug)

> Add back-pressure to Spark Streaming
> 
>
> Key: SPARK-7398
> URL: https://issues.apache.org/jira/browse/SPARK-7398
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: François Garillot
>  Labels: streams
>
> Spark Streaming has trouble dealing with situations where 
>  batch processing time > batch interval
> Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
> from the queue.
> If this throughput is sustained for long enough, it leads to an unstable 
> situation where the memory of the Receiver's Executor is overflowed.
> This aims at transmitting a back-pressure signal back to data ingestion to 
> help with dealing with that high throughput, in a backwards-compatible way.
> The design doc can be found here:
> https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7378) HistoryServer does not handle "deep" link when lazy loading app

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7378.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0

> HistoryServer does not handle "deep" link when lazy loading app
> ---
>
> Key: SPARK-7378
> URL: https://issues.apache.org/jira/browse/SPARK-7378
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.4.0
>
>
> This is a regression caused by SPARK-4705. When you go to a deep link into an 
> app that is not loaded yet, that used to work, but now that returns a 404. 
> You need to go into the root of the app first for the app to be loaded, which 
> is not the expected behaviour.
> Fix coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7466) DAG visualization: orphaned nodes are not rendered correctly

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7466.

   Resolution: Fixed
Fix Version/s: 1.4.0

> DAG visualization: orphaned nodes are not rendered correctly
> 
>
> Key: SPARK-7466
> URL: https://issues.apache.org/jira/browse/SPARK-7466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.4.0
>
> Attachments: after.png, before.png
>
>
> If you have an RDD instantiated outside of a scope, it is rendered as a weird 
> badge outside of a stage. This is because we keep the edge but do not inform 
> dagre-d3 of the node, resulting in the library rendering the node for us 
> without the expected styles and labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7489.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Vinod KC
Target Version/s: 1.4.0

> Spark shell crashes when compiled with scala 2.11 and 
> SPARK_PREPEND_CLASSES=true
> 
>
> Key: SPARK-7489
> URL: https://issues.apache.org/jira/browse/SPARK-7489
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Vinod KC
>Assignee: Vinod KC
> Fix For: 1.4.0
>
>
> Steps followed
> >export SPARK_PREPEND_CLASSES=true
> >dev/change-version-to-2.11.sh
> > sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly
> >bin/spark-shell
> 
> 15/05/08 22:31:35 INFO Main: Created spark context..
> Spark context available as sc.
> java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
>   at java.lang.Class.getDeclaredConstructors0(Native Method)
>   at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
>   at java.lang.Class.getConstructor0(Class.java:3075)
>   at java.lang.Class.getConstructor(Class.java:1825)
>   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86)
>   ... 45 elided
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.conf.HiveConf
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 50 more
> :11: error: not found: value sqlContext
>import sqlContext.implicits._
>   ^
> :11: error: not found: value sqlContext
>import sqlContext.sql
> There is a similar Resolved JIRA issue  -SPARK-7470 and a PR 
> https://github.com/apache/spark/pull/5997 , which handled same  issue  only 
> in scala 2.10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7443) MLlib 1.4 QA plan

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7443:
-
Description: 
TODO: create JIRAs for each task and assign them accordingly.

h2. API

* Check API compliance using java-compliance-checker (SPARK-7458)

* Audit new public APIs (from the generated html doc)
** Scala (do not forget to check the object doc)
** Java compatibility
** Python API coverage

* audit Pipeline APIs
** feature transformers
** tree models
** elastic-net
** ML attributes
** developer APIs

* graduate spark.ml from alpha
** remove AlphaComponent annotations
** remove mima excludes for spark.ml

h2. Algorithms and performance

* list missing performance tests from spark-perf
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* Bernoulli naive Bayes (SPARK-7453)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python

correctness:
* PMML
** scoring using PMML evaluator vs. MLlib models
* save/load

h2. Documentation and example code

* create JIRAs for the user guide to each new algorithm and assign them to the 
corresponding author

* create example code for major components
** cross validation in python
** pipeline with complex feature transformations (scala/java/python)
** elastic-net (possibly with cross validation)

  was:
TODO: create JIRAs for each task and assign them accordingly.

h2. API

* Check API compliance using java-compliance-checker (SPARK-7458)

* Audit new public APIs (from the generated html doc)
** Scala (do not forget to check the object doc)
** Java compatibility
** Python API coverage

* audit Pipeline APIs
** feature transformers
** tree models
** elastic-net
** ML attributes
** developer APIs

* graduate spark.ml from alpha
** remove AlphaComponent annotations
** remove mima excludes for spark.ml

h2. Algorithms and performance

* list missing performance tests from spark-perf
* LDA online/EM (SPARK-7455)
* ElasticNet (SPARK-7456)
* Bernoulli naive Bayes (SPARK-7453)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python

correctness:
* PMML
** scoring using PMML evaluator vs. MLlib models
* save/load

h2. Documentation and example code

* create JIRAs for the user guide to each new algorithm and assign them to the 
corresponding author

* create example code for major components
** cross validation in python
** pipeline with complex feature transformations (scala/java/python)
** elastic-net (possibly with cross validation)


> MLlib 1.4 QA plan
> -
>
> Key: SPARK-7443
> URL: https://issues.apache.org/jira/browse/SPARK-7443
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> TODO: create JIRAs for each task and assign them accordingly.
> h2. API
> * Check API compliance using java-compliance-checker (SPARK-7458)
> * Audit new public APIs (from the generated html doc)
> ** Scala (do not forget to check the object doc)
> ** Java compatibility
> ** Python API coverage
> * audit Pipeline APIs
> ** feature transformers
> ** tree models
> ** elastic-net
> ** ML attributes
> ** developer APIs
> * graduate spark.ml from alpha
> ** remove AlphaComponent annotations
> ** remove mima excludes for spark.ml
> h2. Algorithms and performance
> * list missing performance tests from spark-perf
> * LDA online/EM (SPARK-7455)
> * ElasticNet for linear regression and logistic regression (SPARK-7456)
> * Bernoulli naive Bayes (SPARK-7453)
> * PIC (SPARK-7454)
> * ALS.recommendAll (SPARK-7457)
> * perf-tests in Python
> correctness:
> * PMML
> ** scoring using PMML evaluator vs. MLlib models
> * save/load
> h2. Documentation and example code
> * create JIRAs for the user guide to each new algorithm and assign them to 
> the corresponding author
> * create example code for major components
> ** cross validation in python
> ** pipeline with complex feature transformations (scala/java/python)
> ** elastic-net (possibly with cross validation)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7456) Perf test for linear regression and logistic regression with elastic-net

2015-05-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7456:
-
Summary: Perf test for linear regression and logistic regression with 
elastic-net  (was: Perf test for linear regression with elastic-net)

> Perf test for linear regression and logistic regression with elastic-net
> 
>
> Key: SPARK-7456
> URL: https://issues.apache.org/jira/browse/SPARK-7456
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7490:
-
Assignee: Evan Jones

> MapOutputTracker: close input streams to free native memory
> ---
>
> Key: SPARK-7490
> URL: https://issues.apache.org/jira/browse/SPARK-7490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Evan Jones
>Assignee: Evan Jones
>Priority: Minor
> Fix For: 1.2.3, 1.3.2, 1.4.0
>
>
> GZIPInputStream allocates native memory that is not freed until close() or 
> when the finalizer runs. It is best to close() these streams explicitly to 
> avoid native memory leaks
> Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7490.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.2.3
   1.3.2

Issue resolved by pull request 5982
[https://github.com/apache/spark/pull/5982]

> MapOutputTracker: close input streams to free native memory
> ---
>
> Key: SPARK-7490
> URL: https://issues.apache.org/jira/browse/SPARK-7490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Evan Jones
>Priority: Minor
> Fix For: 1.3.2, 1.2.3, 1.4.0
>
>
> GZIPInputStream allocates native memory that is not freed until close() or 
> when the finalizer runs. It is best to close() these streams explicitly to 
> avoid native memory leaks
> Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7492:
---

Assignee: Apache Spark

> Convert LocalDataFrame to LocalMatrix
> -
>
> Key: SPARK-7492
> URL: https://issues.apache.org/jira/browse/SPARK-7492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Having a method like, 
> {code:java}
> Matrices.fromDataFrame(df)
> {code}
> would provide users the ability to perform feature selection with DataFrames.
> Users will be able to chain operations like below:
> {code:java}
> import org.apache.spark.mllib.linalg.Matrices
> import org.apache.spark.mllib.stat.Statistics
> import org.apache.spark.sql.DataFrame
> val df = ... // the DataFrame
> val contingencyTable = df.stat.crosstab(col1, col2)
> val ct = Matrices.fromDataFrame(contingencyTable)
> val result: ChiSqTestResult = Statistics.chiSqTest(ct)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7492:
---

Assignee: (was: Apache Spark)

> Convert LocalDataFrame to LocalMatrix
> -
>
> Key: SPARK-7492
> URL: https://issues.apache.org/jira/browse/SPARK-7492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Burak Yavuz
>
> Having a method like, 
> {code:java}
> Matrices.fromDataFrame(df)
> {code}
> would provide users the ability to perform feature selection with DataFrames.
> Users will be able to chain operations like below:
> {code:java}
> import org.apache.spark.mllib.linalg.Matrices
> import org.apache.spark.mllib.stat.Statistics
> import org.apache.spark.sql.DataFrame
> val df = ... // the DataFrame
> val contingencyTable = df.stat.crosstab(col1, col2)
> val ct = Matrices.fromDataFrame(contingencyTable)
> val result: ChiSqTestResult = Statistics.chiSqTest(ct)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535515#comment-14535515
 ] 

Apache Spark commented on SPARK-7492:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/6018

> Convert LocalDataFrame to LocalMatrix
> -
>
> Key: SPARK-7492
> URL: https://issues.apache.org/jira/browse/SPARK-7492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Burak Yavuz
>
> Having a method like, 
> {code:java}
> Matrices.fromDataFrame(df)
> {code}
> would provide users the ability to perform feature selection with DataFrames.
> Users will be able to chain operations like below:
> {code:java}
> import org.apache.spark.mllib.linalg.Matrices
> import org.apache.spark.mllib.stat.Statistics
> import org.apache.spark.sql.DataFrame
> val df = ... // the DataFrame
> val contingencyTable = df.stat.crosstab(col1, col2)
> val ct = Matrices.fromDataFrame(contingencyTable)
> val result: ChiSqTestResult = Statistics.chiSqTest(ct)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7496) Update Programming guide with Online LDA

2015-05-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7496:


 Summary: Update Programming guide with Online LDA
 Key: SPARK-7496
 URL: https://issues.apache.org/jira/browse/SPARK-7496
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Update LDA subsection of clustering section of MLlib programming guide to 
include OnlineLDA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7469) DAG visualization: show operators for SQL

2015-05-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7469:
-
Attachment: after.png
before.png

> DAG visualization: show operators for SQL
> -
>
> Key: SPARK-7469
> URL: https://issues.apache.org/jira/browse/SPARK-7469
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Attachments: after.png, before.png
>
>
> Right now the DAG shows low level Spark operations when SQL users really care 
> about physical operators. We should show those instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7495) Improve ML attribute documentation

2015-05-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7495:


 Summary: Improve ML attribute documentation
 Key: SPARK-7495
 URL: https://issues.apache.org/jira/browse/SPARK-7495
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Priority: Minor


ML attribute documentation is currently minimal.  This has led to confusion in 
some Spark PRs about how to use them.

We should add:
* Scala doc
* examples in the programming guide

The docs should make at least these items clear:
* What the different attribute types are
* How an attribute and attribute group differ
* Example usage creating, modifying, and reading attributes
* Explanation that missing attributes are OK and can be computed/added lazily




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7461) Remove spark.ml Model, and have all Transformers have parent

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7461:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-5874

> Remove spark.ml Model, and have all Transformers have parent
> 
>
> Key: SPARK-7461
> URL: https://issues.apache.org/jira/browse/SPARK-7461
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> A recent PR [https://github.com/apache/spark/pull/5980] brought up an issue 
> with the Model abstraction: There are transformers which could be 
> Transformers (created by a user) or Models (created by an Estimator).  This 
> is the first instance, but there will be more such transformers in the future.
> Some possible fixes are:
> * Create 2 separate classes, 1 extending Transformer and 1 extending Model.  
> These would be essentially the same, and they could share code (or have 1 
> wrap the other).  This would bloat the API.
> * Just use Model, with a possibly null parent class.  There is precedence 
> (meta-algorithms like RandomForest producing weak hypothesis Models with no 
> parent).
> * Change Transformer to have a parent which may be null.
> ** *--> Unless there is strong disagreement, I think we should go with this 
> last option.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7494) spark.ml Model should call copyValues in construction

2015-05-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7494:


 Summary: spark.ml Model should call copyValues in construction
 Key: SPARK-7494
 URL: https://issues.apache.org/jira/browse/SPARK-7494
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


Currently, Estimators call Params.copyValues to copy parameters from themselves 
to the Model they create.  The Model has its Estimator, so it could call 
copyValues upon construction.

Note: I'm linking a patch which will remove Model and use Transformer instead, 
but this same fix with copyValues can be applied to Transformer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5980) Add GradientBoostedTrees Python examples to ML guide

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5980:
-
Target Version/s: 1.3.0

> Add GradientBoostedTrees Python examples to ML guide
> 
>
> Key: SPARK-5980
> URL: https://issues.apache.org/jira/browse/SPARK-5980
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.3.0
>
>
> GBT now has a Python API and should have examples in the ML guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5980) Add GradientBoostedTrees Python examples to ML guide

2015-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5980.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Add GradientBoostedTrees Python examples to ML guide
> 
>
> Key: SPARK-5980
> URL: https://issues.apache.org/jira/browse/SPARK-5980
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.3.0
>
>
> GBT now has a Python API and should have examples in the ML guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile

2015-05-08 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535386#comment-14535386
 ] 

Josh Rosen commented on SPARK-7410:
---

We should confirm this, but if I recall the reason that we have to broadcast 
these separately has something to do with configuration mutability or 
thread-safety.  Based on a quick glance at SPARK-2585, it looks like I tried 
folding this into the RDD broadcast but this caused performance issues for RDDs 
with huge numbers of tasks.  If you're interested in fixing this, I'd take a 
closer look through that old JIRA to try to figure out whether its discussion 
is still relevant.

> Add option to avoid broadcasting configuration with newAPIHadoopFile
> 
>
> Key: SPARK-7410
> URL: https://issues.apache.org/jira/browse/SPARK-7410
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Sandy Ryza
>
> I'm working with a Spark application that creates thousands of HadoopRDDs and 
> unions them together.  Certain details of the way the data is stored require 
> this.
> Creating ten thousand of these RDDs takes about 10 minutes, even before any 
> of them is used in an action.  I dug into why this takes so long and it looks 
> like the overhead of broadcasting the Hadoop configuration is taking up most 
> of the time.  In this case, the broadcasting isn't helpful because each 
> HadoopRDD only corresponds to one or two tasks.  When I reverted the original 
> change that switched to broadcasting configurations, the time it took to 
> instantiate these RDDs improved 10x.
> It would be nice if there was a way to turn this broadcasting off.  Either 
> through a Spark configuration option, a Hadoop configuration option, or an 
> argument to hadoopFile / newAPIHadoopFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7493) ALTER TABLE statement

2015-05-08 Thread Sergey Semichev (JIRA)
Sergey Semichev created SPARK-7493:
--

 Summary: ALTER TABLE statement
 Key: SPARK-7493
 URL: https://issues.apache.org/jira/browse/SPARK-7493
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: Databricks cloud
Reporter: Sergey Semichev
Priority: Minor


Full table name (database_name.table_name) cannot be used with "ALTER TABLE" 
statement 
It works with CREATE TABLE

"ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', 
source_month='01')."

Error in SQL statement: java.lang.RuntimeException: 
org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting 
KW_EXCHANGE near 'test_table' in alter exchange partition;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-05-08 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-7492:
---
Description: 
Having a method like, 
{code:java}
Matrices.fromDataFrame(df)
{code}
would provide users the ability to perform feature selection with DataFrames.
Users will be able to chain operations like below:
{code:java}
import org.apache.spark.mllib.linalg.Matrices
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.DataFrame

val df = ... // the DataFrame
val contingencyTable = df.stat.crosstab(col1, col2)
val ct = Matrices.fromDataFrame(contingencyTable)
val result: ChiSqTestResult = Statistics.chiSqTest(ct)
{code}

  was:
Having a method like, 
{code: java}
Matrices.fromDataFrame(df)
{code}
would provide users the ability to perform feature selection with DataFrames.
Users will be able to chain operations like below:
{code: java}
import org.apache.spark.mllib.linalg.Matrices
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.DataFrame

val df = ... // the DataFrame
val contingencyTable = df.stat.crosstab(col1, col2)
val ct = Matrices.fromDataFrame(contingencyTable)
val result: ChiSqTestResult = Statistics.chiSqTest(ct)
{code}


> Convert LocalDataFrame to LocalMatrix
> -
>
> Key: SPARK-7492
> URL: https://issues.apache.org/jira/browse/SPARK-7492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Burak Yavuz
>
> Having a method like, 
> {code:java}
> Matrices.fromDataFrame(df)
> {code}
> would provide users the ability to perform feature selection with DataFrames.
> Users will be able to chain operations like below:
> {code:java}
> import org.apache.spark.mllib.linalg.Matrices
> import org.apache.spark.mllib.stat.Statistics
> import org.apache.spark.sql.DataFrame
> val df = ... // the DataFrame
> val contingencyTable = df.stat.crosstab(col1, col2)
> val ct = Matrices.fromDataFrame(contingencyTable)
> val result: ChiSqTestResult = Statistics.chiSqTest(ct)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-05-08 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-7492:
---
Description: 
Having a method like, 
{code: java}
Matrices.fromDataFrame(df)
{code}
would provide users the ability to perform feature selection with DataFrames.
Users will be able to chain operations like below:
{code: java}
import org.apache.spark.mllib.linalg.Matrices
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.DataFrame

val df = ... // the DataFrame
val contingencyTable = df.stat.crosstab(col1, col2)
val ct = Matrices.fromDataFrame(contingencyTable)
val result: ChiSqTestResult = Statistics.chiSqTest(ct)
{code}

  was:
Having a method like, 
{code: scala}
Matrices.fromDataFrame(df)
{code}
would provide users the ability to perform feature selection with DataFrames.
Users will be able to chain operations like below:
{code: scala}
import org.apache.spark.mllib.linalg.Matrices
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.DataFrame

val df = ... // the DataFrame
val contingencyTable = df.stat.crosstab(col1, col2)
val ct = Matrices.fromDataFrame(contingencyTable)
val result: ChiSqTestResult = Statistics.chiSqTest(ct)
{code}


> Convert LocalDataFrame to LocalMatrix
> -
>
> Key: SPARK-7492
> URL: https://issues.apache.org/jira/browse/SPARK-7492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Burak Yavuz
>
> Having a method like, 
> {code: java}
> Matrices.fromDataFrame(df)
> {code}
> would provide users the ability to perform feature selection with DataFrames.
> Users will be able to chain operations like below:
> {code: java}
> import org.apache.spark.mllib.linalg.Matrices
> import org.apache.spark.mllib.stat.Statistics
> import org.apache.spark.sql.DataFrame
> val df = ... // the DataFrame
> val contingencyTable = df.stat.crosstab(col1, col2)
> val ct = Matrices.fromDataFrame(contingencyTable)
> val result: ChiSqTestResult = Statistics.chiSqTest(ct)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-05-08 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-7492:
--

 Summary: Convert LocalDataFrame to LocalMatrix
 Key: SPARK-7492
 URL: https://issues.apache.org/jira/browse/SPARK-7492
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, SQL
Reporter: Burak Yavuz


Having a method like, 
{code: scala}
Matrices.fromDataFrame(df)
{code}
would provide users the ability to perform feature selection with DataFrames.
Users will be able to chain operations like below:
{code: scala}
import org.apache.spark.mllib.linalg.Matrices
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.DataFrame

val df = ... // the DataFrame
val contingencyTable = df.stat.crosstab(col1, col2)
val ct = Matrices.fromDataFrame(contingencyTable)
val result: ChiSqTestResult = Statistics.chiSqTest(ct)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7491) Handle drivers for Metastore JDBC

2015-05-08 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-7491:
---

 Summary: Handle drivers for Metastore JDBC
 Key: SPARK-7491
 URL: https://issues.apache.org/jira/browse/SPARK-7491
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7487) Python API for ml.regression

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535324#comment-14535324
 ] 

Apache Spark commented on SPARK-7487:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/6016

> Python API for ml.regression
> 
>
> Key: SPARK-7487
> URL: https://issues.apache.org/jira/browse/SPARK-7487
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7487) Python API for ml.regression

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7487:
---

Assignee: (was: Apache Spark)

> Python API for ml.regression
> 
>
> Key: SPARK-7487
> URL: https://issues.apache.org/jira/browse/SPARK-7487
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7487) Python API for ml.regression

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7487:
---

Assignee: Apache Spark

> Python API for ml.regression
> 
>
> Key: SPARK-7487
> URL: https://issues.apache.org/jira/browse/SPARK-7487
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle

2015-05-08 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535316#comment-14535316
 ] 

Josh Rosen commented on SPARK-7448:
---

This is a change that would be nice to performance benchmark. It might require 
a large job, such as a huge flatMap, before we see any significant improvement 
here.

> Implement custom bye array serializer for use in PySpark shuffle
> 
>
> Key: SPARK-7448
> URL: https://issues.apache.org/jira/browse/SPARK-7448
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Shuffle
>Reporter: Josh Rosen
>
> PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
> should implement a custom Serializer for use in these shuffles.  This will 
> allow us to take advantage of shuffle optimizations like SPARK-7311 for 
> PySpark without requiring users to change the default serializer to 
> KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle

2015-05-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7448:
--
Priority: Minor  (was: Major)

> Implement custom bye array serializer for use in PySpark shuffle
> 
>
> Key: SPARK-7448
> URL: https://issues.apache.org/jira/browse/SPARK-7448
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Shuffle
>Reporter: Josh Rosen
>Priority: Minor
>
> PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We 
> should implement a custom Serializer for use in these shuffles.  This will 
> allow us to take advantage of shuffle optimizations like SPARK-7311 for 
> PySpark without requiring users to change the default serializer to 
> KryoSerializer (this is useful for JobServer-type applications).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7488) Python API for ml.recommendation

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7488:
---

Assignee: (was: Apache Spark)

> Python API for ml.recommendation
> 
>
> Key: SPARK-7488
> URL: https://issues.apache.org/jira/browse/SPARK-7488
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7488) Python API for ml.recommendation

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535284#comment-14535284
 ] 

Apache Spark commented on SPARK-7488:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/6015

> Python API for ml.recommendation
> 
>
> Key: SPARK-7488
> URL: https://issues.apache.org/jira/browse/SPARK-7488
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7488) Python API for ml.recommendation

2015-05-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7488:
---

Assignee: Apache Spark

> Python API for ml.recommendation
> 
>
> Key: SPARK-7488
> URL: https://issues.apache.org/jira/browse/SPARK-7488
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7490) MapOutputTracker: close input streams to free native memory

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535264#comment-14535264
 ] 

Apache Spark commented on SPARK-7490:
-

User 'evanj' has created a pull request for this issue:
https://github.com/apache/spark/pull/5982

> MapOutputTracker: close input streams to free native memory
> ---
>
> Key: SPARK-7490
> URL: https://issues.apache.org/jira/browse/SPARK-7490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Evan Jones
>Priority: Minor
>
> GZIPInputStream allocates native memory that is not freed until close() or 
> when the finalizer runs. It is best to close() these streams explicitly to 
> avoid native memory leaks
> Pull request here: https://github.com/apache/spark/pull/5982



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >