[jira] [Resolved] (SPARK-7976) Add style checker to disallow overriding finalize

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7976.

   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1

> Add style checker to disallow overriding finalize
> -
>
> Key: SPARK-7976
> URL: https://issues.apache.org/jira/browse/SPARK-7976
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.1, 1.5.0
>
>
> finalize() is called when the object is garbage collected, and garbage 
> collection is not guaranteed to happen. It is therefore unwise to rely on 
> code in finalize() method.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_NoFinalizeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7976) Add style checker to disallow overriding finalize

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7976:
---
Summary: Add style checker to disallow overriding finalize  (was: Detect if 
finalize is used)

> Add style checker to disallow overriding finalize
> -
>
> Key: SPARK-7976
> URL: https://issues.apache.org/jira/browse/SPARK-7976
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> finalize() is called when the object is garbage collected, and garbage 
> collection is not guaranteed to happen. It is therefore unwise to rely on 
> code in finalize() method.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_NoFinalizeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566391#comment-14566391
 ] 

Reynold Xin commented on SPARK-7197:


I'm going to close this ticket since it is not a bug. We can create a separate 
ticket to update documentation and add support for "on", then maybe it will be 
easier to do equijoins for users.

> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>Priority: Critical
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7197.
--
Resolution: Not A Problem

> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>Priority: Critical
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566389#comment-14566389
 ] 

Reynold Xin edited comment on SPARK-7197 at 5/31/15 6:34 AM:
-

Actually I don't think this is a bug. If you do this, then it works: 

{code}
a.join(b, (a.year==b.year) & (a.month==b.month), 'inner').explain(True)
{code}

This is a little bit unfortunate because it's not possible to override && and 
"and" in Python, so we took the same way Pandas handle ands -- by overriding 
"&".



was (Author: rxin):
Actually I think this is not a bug. If you do this, then it works: 

{code}
a.join(b, (a.year==b.year) & (a.month==b.month), 'inner').explain(True)
{code}

This is a little bit unfortunate because it's not possible to override && and 
"and" in Python, so we took the same way Pandas handle ands -- by overriding 
"&".


> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>Priority: Critical
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566389#comment-14566389
 ] 

Reynold Xin commented on SPARK-7197:


Actually I think this is not a bug. If you do this, then it works: 

{code}
a.join(b, (a.year==b.year) & (a.month==b.month), 'inner').explain(True)
{code}

This is a little bit unfortunate because it's not possible to override && and 
"and" in Python, so we took the same way Pandas handle ands -- by overriding 
"&".


> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>Priority: Critical
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7979) Enforce structural type checker

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7979:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Enforce structural type checker
> ---
>
> Key: SPARK-7979
> URL: https://issues.apache.org/jira/browse/SPARK-7979
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Structural types in Scala can use reflection - this can have unexpected 
> performance consequences.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_StructuralTypeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7979) Enforce structural type checker

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566387#comment-14566387
 ] 

Apache Spark commented on SPARK-7979:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6536

> Enforce structural type checker
> ---
>
> Key: SPARK-7979
> URL: https://issues.apache.org/jira/browse/SPARK-7979
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Structural types in Scala can use reflection - this can have unexpected 
> performance consequences.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_StructuralTypeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7979) Enforce structural type checker

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7979:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Enforce structural type checker
> ---
>
> Key: SPARK-7979
> URL: https://issues.apache.org/jira/browse/SPARK-7979
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Structural types in Scala can use reflection - this can have unexpected 
> performance consequences.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_StructuralTypeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7979) Enforce structural type checker

2015-05-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7979:
--

 Summary: Enforce structural type checker
 Key: SPARK-7979
 URL: https://issues.apache.org/jira/browse/SPARK-7979
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Structural types in Scala can use reflection - this can have unexpected 
performance consequences.

See 
http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_StructuralTypeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566382#comment-14566382
 ] 

Reynold Xin commented on SPARK-7197:


Plan from [~rams]

{code}
 a.join(b, a.year==b.year and a.month==b.month).explain()
ShuffledHashJoin [month#0], [month#3], BuildRight
Exchange (HashPartitioning 200)
 PhysicalRDD [month#0,value#1L,year#2], MapPartitionsRDD[12] at 
applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
Exchange (HashPartitioning 200)
 PhysicalRDD [month#3,value#4L,year#5], MapPartitionsRDD[25] at 
applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
{code}

It's missing the join predicate somehow.


> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>Priority: Critical
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7197:
---
Priority: Critical  (was: Major)

> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>Priority: Critical
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566379#comment-14566379
 ] 

Reynold Xin commented on SPARK-7197:


[~davies] this seems to be broken only in Python. Can you take a look?

> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
>  month  value  year month  value  year
> 012200  200512102  1993
> 112200  200512101  1993
> 212300  199412102  1993
> 312300  199412101  1993
> {code}
> It looks like Spark returns some results where an inner join should
> return nothing.
> Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566378#comment-14566378
 ] 

Yin Huai commented on SPARK-7819:
-

[~coderfi] Can you provide some details on those other tests?

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566376#comment-14566376
 ] 

Apache Spark commented on SPARK-3850:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6535

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Background discussions:
> * https://github.com/apache/spark/pull/2619
> * 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
> If you look at [the PR Cheng 
> opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
> white space seemed to mess up some SQL test. That's what spurred the creation 
> of this issue.
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using this 
> [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-30 Thread Fi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566377#comment-14566377
 ] 

Fi commented on SPARK-7819:
---

FYI, seems to work on a basic test, thanks!

However, I am running into a Out of PermGen space error on other tests further 
down in the same process.
I'll try increasing the memory settings in the JVM OPTS to see if it goes away. 
Hopefully it's not due to some sort of a resource leak, but simply because more 
classes need to be kept in memory now.

Fi


> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7197:
---
Description: 
It looks like join with DataFrames API in python does not return correct 
results if using more 2 or more columns.  The example in the documentation
only shows a single column.

Here is an example to show the problem:

Example code
{code}
import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print "Pandas"  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print "Spark"
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
{code}
*Output
{code}
Pandas
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

 month  value  year month  value  year
012200  200512102  1993
112200  200512101  1993
212300  199412102  1993
312300  199412101  1993
{code}

It looks like Spark returns some results where an inner join should
return nothing.

Confirmed on user mailing list as an issue with Ayan Guha.

  was:
It looks like join with DataFrames API in python does not return correct 
results if using more 2 or more columns.  The example in the documentation
only shows a single column.

Here is an example to show the problem:

Example code

import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print "Pandas"  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print "Spark"
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()

*Output

Pandas
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

 month  value  year month  value  year
012200  200512102  1993
112200  200512101  1993
212300  199412102  1993
312300  199412101  1993

It looks like Spark returns some results where an inner join should
return nothing.

Confirmed on user mailing list as an issue with Ayan Guha.


> Join with DataFrame Python API not working properly with more than 1 column
> ---
>
> Key: SPARK-7197
> URL: https://issues.apache.org/jira/browse/SPARK-7197
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.1
>Reporter: Ali Bajwa
>
> It looks like join with DataFrames API in python does not return correct 
> results if using more 2 or more columns.  The example in the documentation
> only shows a single column.
> Here is an example to show the problem:
> Example code
> {code}
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> {code}
> *Output
> {code}
> Pandas
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 012101  1993
> 112102  1993
> Empty DataFrame
> Columns: [month, value_x, year, value_y]
> Index: []
> Spark
>   month  value  year
> 0 5100  1993
> 112200  2005
> 212300  1994
>   month  value  year
> 0

[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566373#comment-14566373
 ] 

Apache Spark commented on SPARK-3850:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6534

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Background discussions:
> * https://github.com/apache/spark/pull/2619
> * 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
> If you look at [the PR Cheng 
> opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
> white space seemed to mess up some SQL test. That's what spurred the creation 
> of this issue.
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using this 
> [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566368#comment-14566368
 ] 

Apache Spark commented on SPARK-3850:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6533

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Background discussions:
> * https://github.com/apache/spark/pull/2619
> * 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
> If you look at [the PR Cheng 
> opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
> white space seemed to mess up some SQL test. That's what spurred the creation 
> of this issue.
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using this 
> [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7978) DecimalType should not be singleton

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7978:
---

Assignee: Davies Liu  (was: Apache Spark)

> DecimalType should not be singleton
> ---
>
> Key: SPARK-7978
> URL: https://issues.apache.org/jira/browse/SPARK-7978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> The DecimalType can not be constructed with parameters. When it's constructed 
> without parameters, we always get same objects, which is wrong.
> {code}
> >>> from pyspark.sql.types import *
> >>> DecimalType(1, 2)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: __call__() takes exactly 1 argument (3 given)
> >>> DecimalType()
> DecimalType()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7978) DecimalType should not be singleton

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7978:
---

Assignee: Apache Spark  (was: Davies Liu)

> DecimalType should not be singleton
> ---
>
> Key: SPARK-7978
> URL: https://issues.apache.org/jira/browse/SPARK-7978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Blocker
>
> The DecimalType can not be constructed with parameters. When it's constructed 
> without parameters, we always get same objects, which is wrong.
> {code}
> >>> from pyspark.sql.types import *
> >>> DecimalType(1, 2)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: __call__() takes exactly 1 argument (3 given)
> >>> DecimalType()
> DecimalType()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7978) DecimalType should not be singleton

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566364#comment-14566364
 ] 

Apache Spark commented on SPARK-7978:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6532

> DecimalType should not be singleton
> ---
>
> Key: SPARK-7978
> URL: https://issues.apache.org/jira/browse/SPARK-7978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> The DecimalType can not be constructed with parameters. When it's constructed 
> without parameters, we always get same objects, which is wrong.
> {code}
> >>> from pyspark.sql.types import *
> >>> DecimalType(1, 2)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: __call__() takes exactly 1 argument (3 given)
> >>> DecimalType()
> DecimalType()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7710) User guide and example code for math/stat functions in DataFrames

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7710:
---

Assignee: Burak Yavuz  (was: Apache Spark)

> User guide and example code for math/stat functions in DataFrames
> -
>
> Key: SPARK-7710
> URL: https://issues.apache.org/jira/browse/SPARK-7710
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7710) User guide and example code for math/stat functions in DataFrames

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7710:
---

Assignee: Apache Spark  (was: Burak Yavuz)

> User guide and example code for math/stat functions in DataFrames
> -
>
> Key: SPARK-7710
> URL: https://issues.apache.org/jira/browse/SPARK-7710
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7710) User guide and example code for math/stat functions in DataFrames

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566361#comment-14566361
 ] 

Apache Spark commented on SPARK-7710:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/6531

> User guide and example code for math/stat functions in DataFrames
> -
>
> Key: SPARK-7710
> URL: https://issues.apache.org/jira/browse/SPARK-7710
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7978) DecimalType should not be singleton

2015-05-30 Thread Davies Liu (JIRA)
Davies Liu created SPARK-7978:
-

 Summary: DecimalType should not be singleton
 Key: SPARK-7978
 URL: https://issues.apache.org/jira/browse/SPARK-7978
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


The DecimalType can not be constructed with parameters. When it's constructed 
without parameters, we always get same objects, which is wrong.

{code}
>>> from pyspark.sql.types import *
>>> DecimalType(1, 2)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: __call__() takes exactly 1 argument (3 given)
>>> DecimalType()
DecimalType()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3850) Scala style: disallow trailing spaces

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3850:
---

Assignee: (was: Apache Spark)

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Background discussions:
> * https://github.com/apache/spark/pull/2619
> * 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
> If you look at [the PR Cheng 
> opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
> white space seemed to mess up some SQL test. That's what spurred the creation 
> of this issue.
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using this 
> [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566360#comment-14566360
 ] 

Apache Spark commented on SPARK-3850:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6530

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Background discussions:
> * https://github.com/apache/spark/pull/2619
> * 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
> If you look at [the PR Cheng 
> opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
> white space seemed to mess up some SQL test. That's what spurred the creation 
> of this issue.
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using this 
> [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3850) Scala style: disallow trailing spaces

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3850:
---

Assignee: Apache Spark

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>
> Background discussions:
> * https://github.com/apache/spark/pull/2619
> * 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
> If you look at [the PR Cheng 
> opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
> white space seemed to mess up some SQL test. That's what spurred the creation 
> of this issue.
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using this 
> [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7977) Disallow println

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7977:
---
Labels: starter  (was: )

> Disallow println
> 
>
> Key: SPARK-7977
> URL: https://issues.apache.org/jira/browse/SPARK-7977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>  Labels: starter
>
> Very often we see pull requests that added println from debugging, but the 
> author forgot to remove it before code review.
> We can use the regex checker to disallow println. For legitimate use of 
> println, we can then disable the rule.
> Add to scalastyle-config.xml file:
> {code}
>class="org.scalastyle.scalariform.TokenChecker" enabled="true">
> ^println$
> 
>   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7977) Disallow println

2015-05-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7977:
--

 Summary: Disallow println
 Key: SPARK-7977
 URL: https://issues.apache.org/jira/browse/SPARK-7977
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Very often we see pull requests that added println from debugging, but the 
author forgot to remove it before code review.

We can use the regex checker to disallow println. For legitimate use of 
println, we can then disable the rule.

Add to scalastyle-config.xml file:
{code}
  
^println$

  
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7976) Detect if finalize is used

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566348#comment-14566348
 ] 

Apache Spark commented on SPARK-7976:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6528

> Detect if finalize is used
> --
>
> Key: SPARK-7976
> URL: https://issues.apache.org/jira/browse/SPARK-7976
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> finalize() is called when the object is garbage collected, and garbage 
> collection is not guaranteed to happen. It is therefore unwise to rely on 
> code in finalize() method.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_NoFinalizeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7976) Detect if finalize is used

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7976:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Detect if finalize is used
> --
>
> Key: SPARK-7976
> URL: https://issues.apache.org/jira/browse/SPARK-7976
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> finalize() is called when the object is garbage collected, and garbage 
> collection is not guaranteed to happen. It is therefore unwise to rely on 
> code in finalize() method.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_NoFinalizeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7976) Detect if finalize is used

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7976:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Detect if finalize is used
> --
>
> Key: SPARK-7976
> URL: https://issues.apache.org/jira/browse/SPARK-7976
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> finalize() is called when the object is garbage collected, and garbage 
> collection is not guaranteed to happen. It is therefore unwise to rely on 
> code in finalize() method.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_NoFinalizeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7976) Detect if finalize is used

2015-05-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7976:
--

 Summary: Detect if finalize is used
 Key: SPARK-7976
 URL: https://issues.apache.org/jira/browse/SPARK-7976
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


finalize() is called when the object is garbage collected, and garbage 
collection is not guaranteed to happen. It is therefore unwise to rely on code 
in finalize() method.


See 
http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_NoFinalizeChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7975) CovariantEqualsChecker

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7975:
---

Assignee: Apache Spark  (was: Reynold Xin)

> CovariantEqualsChecker
> --
>
> Key: SPARK-7975
> URL: https://issues.apache.org/jira/browse/SPARK-7975
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Mistakenly defining a covariant equals() method without overriding method 
> equals(java.lang.Object) can produce unexpected runtime behaviour.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_CovariantEqualsChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7975) CovariantEqualsChecker

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7975:
---

Assignee: Reynold Xin  (was: Apache Spark)

> CovariantEqualsChecker
> --
>
> Key: SPARK-7975
> URL: https://issues.apache.org/jira/browse/SPARK-7975
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mistakenly defining a covariant equals() method without overriding method 
> equals(java.lang.Object) can produce unexpected runtime behaviour.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_CovariantEqualsChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7975) CovariantEqualsChecker

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566337#comment-14566337
 ] 

Apache Spark commented on SPARK-7975:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6527

> CovariantEqualsChecker
> --
>
> Key: SPARK-7975
> URL: https://issues.apache.org/jira/browse/SPARK-7975
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mistakenly defining a covariant equals() method without overriding method 
> equals(java.lang.Object) can produce unexpected runtime behaviour.
> See 
> http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_CovariantEqualsChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7975) CovariantEqualsChecker

2015-05-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7975:
--

 Summary: CovariantEqualsChecker
 Key: SPARK-7975
 URL: https://issues.apache.org/jira/browse/SPARK-7975
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Mistakenly defining a covariant equals() method without overriding method 
equals(java.lang.Object) can produce unexpected runtime behaviour.

See 
http://www.scalastyle.org/rules-0.7.0.html#org_scalastyle_scalariform_CovariantEqualsChecker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7927:
---
Parent Issue: SPARK-3849  (was: SPARK-7974)

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-7974) Stricter style checker rules

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin deleted SPARK-7974:
---


> Stricter style checker rules
> 
>
> Key: SPARK-7974
> URL: https://issues.apache.org/jira/browse/SPARK-7974
> Project: Spark
>  Issue Type: Umbrella
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7940) Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7940:
---
Parent Issue: SPARK-3849  (was: SPARK-7974)

> Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH
> --
>
> Key: SPARK-7940
> URL: https://issues.apache.org/jira/browse/SPARK-7940
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7940) Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7940:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7974

> Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH
> --
>
> Key: SPARK-7940
> URL: https://issues.apache.org/jira/browse/SPARK-7940
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7927) Enforce whitespace for more tokens in style checker

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7927:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7974

> Enforce whitespace for more tokens in style checker
> ---
>
> Key: SPARK-7927
> URL: https://issues.apache.org/jira/browse/SPARK-7927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> Enforce whitespace on comma, colon, if, while, etc ... so we don't need to 
> keep spending time on this in code reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7974) Stricter style checker rules

2015-05-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7974:
--

 Summary: Stricter style checker rules
 Key: SPARK-7974
 URL: https://issues.apache.org/jira/browse/SPARK-7974
 Project: Spark
  Issue Type: Umbrella
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7971) Add JavaDoc style deprecation for deprecated DataFrame methods

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7971.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Add JavaDoc style deprecation for deprecated DataFrame methods
> --
>
> Key: SPARK-7971
> URL: https://issues.apache.org/jira/browse/SPARK-7971
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> Scala @deprecated annotation actually doesn't show up in JavaDoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2873) Support disk spilling in Spark SQL aggregation

2015-05-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2873:

Assignee: (was: Yin Huai)

> Support disk spilling in Spark SQL aggregation
> --
>
> Key: SPARK-2873
> URL: https://issues.apache.org/jira/browse/SPARK-2873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: guowei
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6909) Remove Hive Shim code

2015-05-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6909:

Priority: Critical  (was: Major)

> Remove Hive Shim code
> -
>
> Key: SPARK-6909
> URL: https://issues.apache.org/jira/browse/SPARK-6909
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6909) Remove Hive Shim code

2015-05-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6909:

Target Version/s: 1.5.0

> Remove Hive Shim code
> -
>
> Key: SPARK-6909
> URL: https://issues.apache.org/jira/browse/SPARK-6909
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7973) Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7973:
---

Assignee: Yin Huai  (was: Apache Spark)

> Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"
> 
>
> Key: SPARK-7973
> URL: https://issues.apache.org/jira/browse/SPARK-7973
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Seems CliSuite's "Commands using SerDe provided in --jars" may time out with 
> the current 1 minute setting. I have seen it a few times. For example,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.4-SBT/226/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/consoleFull
> Let's increase this timeout and see if this test will be stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7973) Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566269#comment-14566269
 ] 

Apache Spark commented on SPARK-7973:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6525

> Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"
> 
>
> Key: SPARK-7973
> URL: https://issues.apache.org/jira/browse/SPARK-7973
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Seems CliSuite's "Commands using SerDe provided in --jars" may time out with 
> the current 1 minute setting. I have seen it a few times. For example,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.4-SBT/226/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/consoleFull
> Let's increase this timeout and see if this test will be stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7973) Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7973:
---

Assignee: Apache Spark  (was: Yin Huai)

> Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"
> 
>
> Key: SPARK-7973
> URL: https://issues.apache.org/jira/browse/SPARK-7973
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Seems CliSuite's "Commands using SerDe provided in --jars" may time out with 
> the current 1 minute setting. I have seen it a few times. For example,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.4-SBT/226/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/consoleFull
> Let's increase this timeout and see if this test will be stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5610) Generate Java docs without package private classes and methods

2015-05-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566268#comment-14566268
 ] 

Reynold Xin commented on SPARK-5610:


As commented on github, private objects/classes are still showing up, even 
though package private ones are gone.


> Generate Java docs without package private classes and methods
> --
>
> Key: SPARK-5610
> URL: https://issues.apache.org/jira/browse/SPARK-5610
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> The current generated Java doc is a mixed of public and package private 
> classes and methods. We can update genjavadoc to hide them.
> Upstream PR: https://github.com/typesafehub/genjavadoc/pull/47



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5610) Generate Java docs without package private classes and methods

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5610.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Generate Java docs without package private classes and methods
> --
>
> Key: SPARK-5610
> URL: https://issues.apache.org/jira/browse/SPARK-5610
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> The current generated Java doc is a mixed of public and package private 
> classes and methods. We can update genjavadoc to hide them.
> Upstream PR: https://github.com/typesafehub/genjavadoc/pull/47



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7973) Increase the timeout of CliSuite's "Commands using SerDe provided in --jars"

2015-05-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-7973:
---

 Summary: Increase the timeout of CliSuite's "Commands using SerDe 
provided in --jars"
 Key: SPARK-7973
 URL: https://issues.apache.org/jira/browse/SPARK-7973
 Project: Spark
  Issue Type: Task
  Components: SQL, Tests
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai


Seems CliSuite's "Commands using SerDe provided in --jars" may time out with 
the current 1 minute setting. I have seen it a few times. For example,
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.4-SBT/226/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/consoleFull

Let's increase this timeout and see if this test will be stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7920) Make MLlib ChiSqSelector Serializable (& Fix Related Documentation Example).

2015-05-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7920.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6462
[https://github.com/apache/spark/pull/6462]

> Make MLlib ChiSqSelector Serializable (& Fix Related Documentation Example).
> 
>
> Key: SPARK-7920
> URL: https://issues.apache.org/jira/browse/SPARK-7920
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Mike Dusenberry
>Priority: Minor
> Fix For: 1.4.0
>
>
> The MLlib ChiSqSelector class is not serializable, and so the example in the 
> ChiSqSelector documentation fails.  Also, that example is missing the import 
> of ChiSqSelector.  ChiSqSelector should just extend Serializable.
> Steps:
> 1. Locate the MLlib ChiSqSelector documentation example.
> 2. Fix the example by adding an import statement for ChiSqSelector.
> 3. Attempt to run -> notice that it will fail due to ChiSqSelector not being 
> serializable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7920) Make MLlib ChiSqSelector Serializable (& Fix Related Documentation Example).

2015-05-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7920:
-
Assignee: Mike Dusenberry

> Make MLlib ChiSqSelector Serializable (& Fix Related Documentation Example).
> 
>
> Key: SPARK-7920
> URL: https://issues.apache.org/jira/browse/SPARK-7920
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Mike Dusenberry
>Assignee: Mike Dusenberry
>Priority: Minor
> Fix For: 1.4.0
>
>
> The MLlib ChiSqSelector class is not serializable, and so the example in the 
> ChiSqSelector documentation fails.  Also, that example is missing the import 
> of ChiSqSelector.  ChiSqSelector should just extend Serializable.
> Steps:
> 1. Locate the MLlib ChiSqSelector documentation example.
> 2. Fix the example by adding an import statement for ChiSqSelector.
> 3. Attempt to run -> notice that it will fail due to ChiSqSelector not being 
> serializable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7972) When parse window spec frame, we need to do case insensitive matches.

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566253#comment-14566253
 ] 

Apache Spark commented on SPARK-7972:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6524

> When parse window spec frame, we need to do case insensitive matches.
> -
>
> Key: SPARK-7972
> URL: https://issues.apache.org/jira/browse/SPARK-7972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> For window frame, PRECEDING, FOLLOWING, and CURRENT ROW do not have 
> pre-defined tokens. So, Hive Parser returns the user input directly (e.g. 
> {{preCeDING}}). We need to do case insensitive matches in {{HiveQl.scala}}..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7965) Wrong answers for queries with multiple window specs in the same expression

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7965:
---

Assignee: Apache Spark  (was: Yin Huai)

> Wrong answers for queries with multiple window specs in the same expression
> ---
>
> Key: SPARK-7965
> URL: https://issues.apache.org/jira/browse/SPARK-7965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> I think that Spark SQL may be returning incorrect answers for queries that 
> use multiple window specifications within the same expression.  Here's an 
> example that illustrates the problem.
> Say that I have a table with a single numeric column and that I want to 
> compute a cumulative distribution function over this column.  Let's call this 
> table {{nums}}:
> {code}
> val nums = sc.parallelize(1 to 10).map(x => (x)).toDF("x")
> nums.registerTempTable("nums")
> {code}
> It's easy to compute a running sum over this column:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and current row) 
> from nums
> """).collect()
> nums: org.apache.spark.sql.DataFrame = [x: int]
> res29: Array[org.apache.spark.sql.Row] = Array([1], [3], [6], [10], [15], 
> [21], [28], [36], [45], [55])
> {code}
> It's also easy to compute a total sum over all rows:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and unbounded 
> following) from nums
> """).collect()
> res34: Array[org.apache.spark.sql.Row] = Array([55], [55], [55], [55], [55], 
> [55], [55], [55], [55], [55])
> {code}
> Let's say that I combine these expressions to compute a CDF:
> {code}
> sqlContext.sql("""
>   select (sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> from nums
> """).collect()
> res31: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [1.0], [1.0], [1.0], [1.0], [1.0], [1.0])
> {code}
> This seems wrong.  Note that if we combine the running total, global total, 
> and combined expression in the same query, then we see that the first two 
> values are computed correctly / but the combined expression seems to be 
> incorrect:
> {code}
> sqlContext.sql("""
> select
> sum(x) over (rows between unbounded preceding and current row) as 
> running_sum,
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> as total_sum,
> ((sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following))) 
> as combined
> from nums 
> """).collect()
> res40: Array[org.apache.spark.sql.Row] = Array([1,55,1.0], [3,55,1.0], 
> [6,55,1.0], [10,55,1.0], [15,55,1.0], [21,55,1.0], [28,55,1.0], [36,55,1.0], 
> [45,55,1.0], [55,55,1.0])
> {code}
> /cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7972) When parse window spec frame, we need to do case insensitive matches.

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7972:
---

Assignee: Yin Huai  (was: Apache Spark)

> When parse window spec frame, we need to do case insensitive matches.
> -
>
> Key: SPARK-7972
> URL: https://issues.apache.org/jira/browse/SPARK-7972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> For window frame, PRECEDING, FOLLOWING, and CURRENT ROW do not have 
> pre-defined tokens. So, Hive Parser returns the user input directly (e.g. 
> {{preCeDING}}). We need to do case insensitive matches in {{HiveQl.scala}}..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7972) When parse window spec frame, we need to do case insensitive matches.

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7972:
---

Assignee: Apache Spark  (was: Yin Huai)

> When parse window spec frame, we need to do case insensitive matches.
> -
>
> Key: SPARK-7972
> URL: https://issues.apache.org/jira/browse/SPARK-7972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> For window frame, PRECEDING, FOLLOWING, and CURRENT ROW do not have 
> pre-defined tokens. So, Hive Parser returns the user input directly (e.g. 
> {{preCeDING}}). We need to do case insensitive matches in {{HiveQl.scala}}..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7965) Wrong answers for queries with multiple window specs in the same expression

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566252#comment-14566252
 ] 

Apache Spark commented on SPARK-7965:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6524

> Wrong answers for queries with multiple window specs in the same expression
> ---
>
> Key: SPARK-7965
> URL: https://issues.apache.org/jira/browse/SPARK-7965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Yin Huai
>
> I think that Spark SQL may be returning incorrect answers for queries that 
> use multiple window specifications within the same expression.  Here's an 
> example that illustrates the problem.
> Say that I have a table with a single numeric column and that I want to 
> compute a cumulative distribution function over this column.  Let's call this 
> table {{nums}}:
> {code}
> val nums = sc.parallelize(1 to 10).map(x => (x)).toDF("x")
> nums.registerTempTable("nums")
> {code}
> It's easy to compute a running sum over this column:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and current row) 
> from nums
> """).collect()
> nums: org.apache.spark.sql.DataFrame = [x: int]
> res29: Array[org.apache.spark.sql.Row] = Array([1], [3], [6], [10], [15], 
> [21], [28], [36], [45], [55])
> {code}
> It's also easy to compute a total sum over all rows:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and unbounded 
> following) from nums
> """).collect()
> res34: Array[org.apache.spark.sql.Row] = Array([55], [55], [55], [55], [55], 
> [55], [55], [55], [55], [55])
> {code}
> Let's say that I combine these expressions to compute a CDF:
> {code}
> sqlContext.sql("""
>   select (sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> from nums
> """).collect()
> res31: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [1.0], [1.0], [1.0], [1.0], [1.0], [1.0])
> {code}
> This seems wrong.  Note that if we combine the running total, global total, 
> and combined expression in the same query, then we see that the first two 
> values are computed correctly / but the combined expression seems to be 
> incorrect:
> {code}
> sqlContext.sql("""
> select
> sum(x) over (rows between unbounded preceding and current row) as 
> running_sum,
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> as total_sum,
> ((sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following))) 
> as combined
> from nums 
> """).collect()
> res40: Array[org.apache.spark.sql.Row] = Array([1,55,1.0], [3,55,1.0], 
> [6,55,1.0], [10,55,1.0], [15,55,1.0], [21,55,1.0], [28,55,1.0], [36,55,1.0], 
> [45,55,1.0], [55,55,1.0])
> {code}
> /cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7965) Wrong answers for queries with multiple window specs in the same expression

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7965:
---

Assignee: Yin Huai  (was: Apache Spark)

> Wrong answers for queries with multiple window specs in the same expression
> ---
>
> Key: SPARK-7965
> URL: https://issues.apache.org/jira/browse/SPARK-7965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Yin Huai
>
> I think that Spark SQL may be returning incorrect answers for queries that 
> use multiple window specifications within the same expression.  Here's an 
> example that illustrates the problem.
> Say that I have a table with a single numeric column and that I want to 
> compute a cumulative distribution function over this column.  Let's call this 
> table {{nums}}:
> {code}
> val nums = sc.parallelize(1 to 10).map(x => (x)).toDF("x")
> nums.registerTempTable("nums")
> {code}
> It's easy to compute a running sum over this column:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and current row) 
> from nums
> """).collect()
> nums: org.apache.spark.sql.DataFrame = [x: int]
> res29: Array[org.apache.spark.sql.Row] = Array([1], [3], [6], [10], [15], 
> [21], [28], [36], [45], [55])
> {code}
> It's also easy to compute a total sum over all rows:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and unbounded 
> following) from nums
> """).collect()
> res34: Array[org.apache.spark.sql.Row] = Array([55], [55], [55], [55], [55], 
> [55], [55], [55], [55], [55])
> {code}
> Let's say that I combine these expressions to compute a CDF:
> {code}
> sqlContext.sql("""
>   select (sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> from nums
> """).collect()
> res31: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [1.0], [1.0], [1.0], [1.0], [1.0], [1.0])
> {code}
> This seems wrong.  Note that if we combine the running total, global total, 
> and combined expression in the same query, then we see that the first two 
> values are computed correctly / but the combined expression seems to be 
> incorrect:
> {code}
> sqlContext.sql("""
> select
> sum(x) over (rows between unbounded preceding and current row) as 
> running_sum,
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> as total_sum,
> ((sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following))) 
> as combined
> from nums 
> """).collect()
> res40: Array[org.apache.spark.sql.Row] = Array([1,55,1.0], [3,55,1.0], 
> [6,55,1.0], [10,55,1.0], [15,55,1.0], [21,55,1.0], [28,55,1.0], [36,55,1.0], 
> [45,55,1.0], [55,55,1.0])
> {code}
> /cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7972) When parse window spec frame, we need to do case insensitive matches.

2015-05-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-7972:
---

 Summary: When parse window spec frame, we need to do case 
insensitive matches.
 Key: SPARK-7972
 URL: https://issues.apache.org/jira/browse/SPARK-7972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai


For window frame, PRECEDING, FOLLOWING, and CURRENT ROW do not have pre-defined 
tokens. So, Hive Parser returns the user input directly (e.g. {{preCeDING}}). 
We need to do case insensitive matches in {{HiveQl.scala}}..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7971) Add JavaDoc style deprecation for deprecated DataFrame methods

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7971:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Add JavaDoc style deprecation for deprecated DataFrame methods
> --
>
> Key: SPARK-7971
> URL: https://issues.apache.org/jira/browse/SPARK-7971
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Scala @deprecated annotation actually doesn't show up in JavaDoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7971) Add JavaDoc style deprecation for deprecated DataFrame methods

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566249#comment-14566249
 ] 

Apache Spark commented on SPARK-7971:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6523

> Add JavaDoc style deprecation for deprecated DataFrame methods
> --
>
> Key: SPARK-7971
> URL: https://issues.apache.org/jira/browse/SPARK-7971
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Scala @deprecated annotation actually doesn't show up in JavaDoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7971) Add JavaDoc style deprecation for deprecated DataFrame methods

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7971:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Add JavaDoc style deprecation for deprecated DataFrame methods
> --
>
> Key: SPARK-7971
> URL: https://issues.apache.org/jira/browse/SPARK-7971
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Scala @deprecated annotation actually doesn't show up in JavaDoc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7918) MLlib Python doc parity check for evaluation and feature

2015-05-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7918.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6461
[https://github.com/apache/spark/pull/6461]

> MLlib Python doc parity check for evaluation and feature
> 
>
> Key: SPARK-7918
> URL: https://issues.apache.org/jira/browse/SPARK-7918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.4.0
>
>
> Check then make the MLlib Python evaluation and feature doc to be as complete 
> as the Scala doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7918) MLlib Python doc parity check for evaluation and feature

2015-05-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7918:
-
Assignee: Yanbo Liang

> MLlib Python doc parity check for evaluation and feature
> 
>
> Key: SPARK-7918
> URL: https://issues.apache.org/jira/browse/SPARK-7918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Check then make the MLlib Python evaluation and feature doc to be as complete 
> as the Scala doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7971) Add JavaDoc style deprecation for deprecated DataFrame methods

2015-05-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7971:
--

 Summary: Add JavaDoc style deprecation for deprecated DataFrame 
methods
 Key: SPARK-7971
 URL: https://issues.apache.org/jira/browse/SPARK-7971
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Scala @deprecated annotation actually doesn't show up in JavaDoc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7667) MLlib Python API consistency check

2015-05-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566233#comment-14566233
 ] 

Joseph K. Bradley commented on SPARK-7667:
--

If it's an old API from a previous Spark release, we can't really change the 
public API unless absolutely necessary.  You're right that we mainly need to 
check (a) Alpha/Experimental/Developer APIs and (b) new APIs in this release.

> MLlib Python API consistency check
> --
>
> Key: SPARK-7667
> URL: https://issues.apache.org/jira/browse/SPARK-7667
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Check and ensure the MLlib Python API(class/method/parameter) consistent with 
> Scala.
> The following APIs are not consistent:
> * class
> * method
> * parameter
> ** feature.StandardScaler.fit()
> ** many transform() function of feature module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7855) Move hash-style shuffle code out of ExternalSorter and into own file

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7855.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Move hash-style shuffle code out of ExternalSorter and into own file
> 
>
> Key: SPARK-7855
> URL: https://issues.apache.org/jira/browse/SPARK-7855
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.0
>
>
> ExternalSorter contains a bunch of code for handling the bypassMergeThreshold 
> / hash-style shuffle path.  I think that it would significantly simplify the 
> code to move this functionality out of ExternalSorter and into a separate 
> class which shares a common interface (insertAll / writePartitionedFile()).  
> This is a stepping-stone towards eventually removing this bypass path (see 
> SPARK-6026)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7038) [Streaming] Spark Sink requires spark assembly in classpath

2015-05-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566223#comment-14566223
 ] 

Tathagata Das commented on SPARK-7038:
--

[~hshreedharan] I am assigning it to you to figure out a solution.

> [Streaming] Spark Sink requires spark assembly in classpath
> ---
>
> Key: SPARK-7038
> URL: https://issues.apache.org/jira/browse/SPARK-7038
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
>
> In 1.3.0 Spark, we shaded Guava, which meant that the Spark Sink's guava 
> dependency is not standard guava anymore - thus the one from Flume's 
> classpath does not work and can throw a NoClassDefFoundError while using 
> Spark Sink.
> We must pull in the guava dependency into the Spark Sink jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7038) [Streaming] Spark Sink requires spark assembly in classpath

2015-05-30 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7038:
-
Assignee: Hari Shreedharan

> [Streaming] Spark Sink requires spark assembly in classpath
> ---
>
> Key: SPARK-7038
> URL: https://issues.apache.org/jira/browse/SPARK-7038
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
>
> In 1.3.0 Spark, we shaded Guava, which meant that the Spark Sink's guava 
> dependency is not standard guava anymore - thus the one from Flume's 
> classpath does not work and can throw a NoClassDefFoundError while using 
> Spark Sink.
> We must pull in the guava dependency into the Spark Sink jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7942) Receiver's life cycle is inconsistent with streaming job.

2015-05-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566222#comment-14566222
 ] 

Tathagata Das commented on SPARK-7942:
--

That is a very good idea. In fact please update the JIRA title to describe that 
feature. If there were receivers started, and all the receivers have shutdown, 
then stop the StreamingContext and throw error such that ssc.awaitTermination 
exits. This will be a good feature to add. 

> Receiver's life cycle is inconsistent with streaming job.
> -
>
> Key: SPARK-7942
> URL: https://issues.apache.org/jira/browse/SPARK-7942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>
> Streaming consider the receiver as a common spark job, thus if an error 
> occurs in the receiver's  logical(after 4 times(default) retries ), streaming 
> will no longer get any data but the streaming job is still running. 
> A general scenario is that: we config the 
> `spark.streaming.receiver.writeAheadLog.enable` as true to use the 
> `ReliableKafkaReceiver` but do not set the checkpoint dir. Then the receiver 
> will soon be shut down but the streaming is alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7931) Do not restart a socket receiver when the receiver is being shutdown

2015-05-30 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7931.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

> Do not restart a socket receiver when the receiver is being shutdown
> 
>
> Key: SPARK-7931
> URL: https://issues.apache.org/jira/browse/SPARK-7931
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.4.0
>
>
> Attempts to restart the socket receiver when it is supposed to be stopped 
> causes undesirable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7952) equality check between boolean type and numeric type is broken.

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7952:
---
Description: Currently we only support literal numeric values.  (was: for 
now we only support literal numeric values.)

> equality check between boolean type and numeric type is broken.
> ---
>
> Key: SPARK-7952
> URL: https://issues.apache.org/jira/browse/SPARK-7952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we only support literal numeric values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7849) Update Spark SQL Hive support documentation for 1.4

2015-05-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7849.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Cheng Lian

> Update Spark SQL Hive support documentation for 1.4
> ---
>
> Key: SPARK-7849
> URL: https://issues.apache.org/jira/browse/SPARK-7849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.4.0
>
>
> Hive support contents need to be updated for 1.4. Most importantly, after 
> introducing the isolated classloader mechanism in 1.4, the following 
> questions need to be clarified:
> # How to enable Hive support
> # What versions of Hive are supported
> # How to specify metastore version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7819) Isolated Hive Client Loader appears to cause Native Library libMapRClient.4.0.2-mapr.so already loaded in another classloader error

2015-05-30 Thread Fi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566145#comment-14566145
 ] 

Fi commented on SPARK-7819:
---

Will do, thanks.

> Isolated Hive Client Loader appears to cause Native Library 
> libMapRClient.4.0.2-mapr.so already loaded in another classloader error
> ---
>
> Key: SPARK-7819
> URL: https://issues.apache.org/jira/browse/SPARK-7819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Fi
>Priority: Critical
> Attachments: stacktrace.txt, test.py
>
>
> In reference to the pull request: https://github.com/apache/spark/pull/5876
> I have been running the Spark 1.3 branch for some time with no major hiccups, 
> and recently switched to the Spark 1.4 branch.
> I build my spark distribution with the following build command:
> {noformat}
> make-distribution.sh --tgz --skip-java-test --with-tachyon -Phive 
> -Phive-0.13.1 -Pmapr4 -Pspark-ganglia-lgpl -Pkinesis-asl -Phive-thriftserver
> {noformat}
> When running a python script containing a series of smoke tests I use to 
> validate the build, I encountered an error under the following conditions:
> * start a spark context
> * start a hive context
> * run any hive query
> * stop the spark context
> * start a second spark context
> * run any hive query
> ** ERROR
> From what I can tell, the Isolated Class Loader is hitting a MapR class that 
> is loading its native library (presumedly as part of a static initializer).
> Unfortunately, the JVM prohibits this the second time around.
> I would think that shutting down the SparkContext would clear out any 
> vestigials of the JVM, so I'm surprised that this would even be a problem.
> Note: all other smoke tests we are running passes fine.
> I will attach the stacktrace and a python script reproducing the issue (at 
> least for my environment and build).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7965) Wrong answers for queries with multiple window specs in the same expression

2015-05-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-7965:
---

Assignee: Yin Huai

> Wrong answers for queries with multiple window specs in the same expression
> ---
>
> Key: SPARK-7965
> URL: https://issues.apache.org/jira/browse/SPARK-7965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Yin Huai
>
> I think that Spark SQL may be returning incorrect answers for queries that 
> use multiple window specifications within the same expression.  Here's an 
> example that illustrates the problem.
> Say that I have a table with a single numeric column and that I want to 
> compute a cumulative distribution function over this column.  Let's call this 
> table {{nums}}:
> {code}
> val nums = sc.parallelize(1 to 10).map(x => (x)).toDF("x")
> nums.registerTempTable("nums")
> {code}
> It's easy to compute a running sum over this column:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and current row) 
> from nums
> """).collect()
> nums: org.apache.spark.sql.DataFrame = [x: int]
> res29: Array[org.apache.spark.sql.Row] = Array([1], [3], [6], [10], [15], 
> [21], [28], [36], [45], [55])
> {code}
> It's also easy to compute a total sum over all rows:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and unbounded 
> following) from nums
> """).collect()
> res34: Array[org.apache.spark.sql.Row] = Array([55], [55], [55], [55], [55], 
> [55], [55], [55], [55], [55])
> {code}
> Let's say that I combine these expressions to compute a CDF:
> {code}
> sqlContext.sql("""
>   select (sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> from nums
> """).collect()
> res31: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [1.0], [1.0], [1.0], [1.0], [1.0], [1.0])
> {code}
> This seems wrong.  Note that if we combine the running total, global total, 
> and combined expression in the same query, then we see that the first two 
> values are computed correctly / but the combined expression seems to be 
> incorrect:
> {code}
> sqlContext.sql("""
> select
> sum(x) over (rows between unbounded preceding and current row) as 
> running_sum,
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> as total_sum,
> ((sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following))) 
> as combined
> from nums 
> """).collect()
> res40: Array[org.apache.spark.sql.Row] = Array([1,55,1.0], [3,55,1.0], 
> [6,55,1.0], [10,55,1.0], [15,55,1.0], [21,55,1.0], [28,55,1.0], [36,55,1.0], 
> [45,55,1.0], [55,55,1.0])
> {code}
> /cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
---
Description: 
Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in "getClassReader" method of ClosureCleaner and rest in 
"ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method (See PR - 
https://github.com/apache/spark/pull/6256).

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes

  was:
Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in "getClassReader" method of ClosureCleaner and rest in 
"ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes


> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
---
Attachment: Screen Shot 2015-05-27 at 11.01.03 pm.png

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method.
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
---
Attachment: Screen Shot 2015-05-27 at 11.07.02 pm.png

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method.
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-7970:
--

 Summary: Optimize code for SQL queries fired on Union of RDDs 
(closure cleaner)
 Key: SPARK-7970
 URL: https://issues.apache.org/jira/browse/SPARK-7970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.3.0, 1.2.0
Reporter: Nitin Goyal


Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in "getClassReader" method of ClosureCleaner and rest in 
"ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7849) Update Spark SQL Hive support documentation for 1.4

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7849:
---

Assignee: Apache Spark

> Update Spark SQL Hive support documentation for 1.4
> ---
>
> Key: SPARK-7849
> URL: https://issues.apache.org/jira/browse/SPARK-7849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Critical
>
> Hive support contents need to be updated for 1.4. Most importantly, after 
> introducing the isolated classloader mechanism in 1.4, the following 
> questions need to be clarified:
> # How to enable Hive support
> # What versions of Hive are supported
> # How to specify metastore version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7849) Update Spark SQL Hive support documentation for 1.4

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566119#comment-14566119
 ] 

Apache Spark commented on SPARK-7849:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6520

> Update Spark SQL Hive support documentation for 1.4
> ---
>
> Key: SPARK-7849
> URL: https://issues.apache.org/jira/browse/SPARK-7849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Priority: Critical
>
> Hive support contents need to be updated for 1.4. Most importantly, after 
> introducing the isolated classloader mechanism in 1.4, the following 
> questions need to be clarified:
> # How to enable Hive support
> # What versions of Hive are supported
> # How to specify metastore version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7849) Update Spark SQL Hive support documentation for 1.4

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7849:
---

Assignee: (was: Apache Spark)

> Update Spark SQL Hive support documentation for 1.4
> ---
>
> Key: SPARK-7849
> URL: https://issues.apache.org/jira/browse/SPARK-7849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Priority: Critical
>
> Hive support contents need to be updated for 1.4. Most importantly, after 
> introducing the isolated classloader mechanism in 1.4, the following 
> questions need to be clarified:
> # How to enable Hive support
> # What versions of Hive are supported
> # How to specify metastore version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7947) Serdes Command not working

2015-05-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566101#comment-14566101
 ] 

Liang-Chi Hsieh commented on SPARK-7947:


Are you sure the syntax is correct? According to [Hive 
manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddSerDeProperties],
 looks like adding SerDe properties only supports specifying table_name:
{code}
ALTER TABLE table_name SET SERDEPROPERTIES serde_properties;
{code}

You have specified db_name. I checked the detailed error, it is caused by hive 
parser, not spark. If only using table_name, it works.


> Serdes Command not working 
> ---
>
> Key: SPARK-7947
> URL: https://issues.apache.org/jira/browse/SPARK-7947
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 1.3.1
> Environment: windows 8.1, hadoop 2.5.2, hive 1.1.0, spark 1.3.1, 
> scala 2.10.4
>Reporter: Mallieswari
>
> I have configured spark sql and executed the *hive serde* command like below
> {code}
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.sql("ALTER TABLE event_db.sample SET SERDEPROPERTIES 
> ('serialization.encoding'='GBK')")
> {code}
> Above command is working fine in hive shell but it is not supporting in spark 
> shell.
> Got the below error in spark shell.
> {code}
>   org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:
> 1036)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19
> 9)
> at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16
> 6)
> at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.appl
> y(ExtendedHiveQlParser.scala:41)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.appl
> y(ExtendedHiveQlParser.scala:40)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Par
> sers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Par
> sers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222
> )
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonf
> un$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonf
> un$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:20
> 2)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(
> Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(
> Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222
> )
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply
> (Parsers.scala:891)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply
> (Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890
> )
> at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratPar
> sers.scala:110)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSp
> arkSQLParser.scala:38)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
> at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$Spa
> rkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
> at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$Spa
> rkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Par
> sers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Par
> sers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222
> )
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonf
> un$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers

[jira] [Commented] (SPARK-7899) PySpark sql/tests breaks pylint validation

2015-05-30 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566089#comment-14566089
 ] 

Justin Uang commented on SPARK-7899:


Can we get this back ported into spark 1.4 or is it too late for that.

> PySpark sql/tests breaks pylint validation
> --
>
> Key: SPARK-7899
> URL: https://issues.apache.org/jira/browse/SPARK-7899
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 1.4.0
>Reporter: Michael Nazario
>Assignee: Michael Nazario
> Fix For: 1.5.0
>
>
> The pyspark.sql.types module is dynamically named {{types}} from {{_types}} 
> which messes up pylint validation
> From [~justin.uang] below:
> In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was 
> renamed to {{pyspark/sql/\_types.py}} and then some magic in 
> {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to 
> {{types}}. I imagine that this is some naming conflict with Python 3, but 
> what was the error that showed up?
> The reason why I'm asking about this is because it's messing with pylint, 
> since pylint cannot now statically find the module. I tried also importing 
> the package so that {{\_\_init\_\_}} would be run in a init-hook, but that 
> isn't what the discovery mechanism is using. I imagine it's probably just 
> crawling the directory structure.
> One way to work around this would be something akin to this 
> (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
>  where I would have to create a fake module, but I would probably be missing 
> a ton of pylint features on users of that module, and it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7969) Drop method on Dataframes should handle Column

2015-05-30 Thread Olivier Girardot (JIRA)
Olivier Girardot created SPARK-7969:
---

 Summary: Drop method on Dataframes should handle Column
 Key: SPARK-7969
 URL: https://issues.apache.org/jira/browse/SPARK-7969
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Olivier Girardot
Priority: Minor


For now the drop method available on Dataframe since Spark 1.4.0 only accepts a 
column name (as a string), it should also accept a Column as input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7937:
---

Assignee: Apache Spark

> Cannot compare Hive named_struct. (when using argmax, argmin)
> -
>
> Key: SPARK-7937
> URL: https://issues.apache.org/jira/browse/SPARK-7937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Jianshi Huang
>Assignee: Apache Spark
>
> Imagine the following SQL:
> Intention: get last used bank account country.
>  
> {code:sql}
> select bank_account_id, 
>   max(named_struct(
> 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D 
> HH:mm:ss'), 
> 'bank_country', bank_country)).bank_country 
> from bank_account_monthly
> where year_month='201502' 
> group by bank_account_id
> {code}
> => 
> {noformat}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in 
> stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type 
> StructType(StructField(src_row_update_ts,LongType,true), 
> StructField(bank_country,StringType,true)) does not support ordered operations
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
> at 
> org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565993#comment-14565993
 ] 

Apache Spark commented on SPARK-7937:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6519

> Cannot compare Hive named_struct. (when using argmax, argmin)
> -
>
> Key: SPARK-7937
> URL: https://issues.apache.org/jira/browse/SPARK-7937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Jianshi Huang
>
> Imagine the following SQL:
> Intention: get last used bank account country.
>  
> {code:sql}
> select bank_account_id, 
>   max(named_struct(
> 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D 
> HH:mm:ss'), 
> 'bank_country', bank_country)).bank_country 
> from bank_account_monthly
> where year_month='201502' 
> group by bank_account_id
> {code}
> => 
> {noformat}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in 
> stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type 
> StructType(StructField(src_row_update_ts,LongType,true), 
> StructField(bank_country,StringType,true)) does not support ordered operations
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
> at 
> org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7937) Cannot compare Hive named_struct. (when using argmax, argmin)

2015-05-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7937:
---

Assignee: (was: Apache Spark)

> Cannot compare Hive named_struct. (when using argmax, argmin)
> -
>
> Key: SPARK-7937
> URL: https://issues.apache.org/jira/browse/SPARK-7937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Jianshi Huang
>
> Imagine the following SQL:
> Intention: get last used bank account country.
>  
> {code:sql}
> select bank_account_id, 
>   max(named_struct(
> 'src_row_update_ts', unix_timestamp(src_row_update_ts,'/M/D 
> HH:mm:ss'), 
> 'bank_country', bank_country)).bank_country 
> from bank_account_monthly
> where year_month='201502' 
> group by bank_account_id
> {code}
> => 
> {noformat}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 94 in stage 96.0 failed 4 times, most recent failure: Lost task 94.3 in 
> stage 96.0 (TID 22281, ): java.lang.RuntimeException: Type 
> StructType(StructField(src_row_update_ts,LongType,true), 
> StructField(bank_country,StringType,true)) does not support ordered operations
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering$lzycompute(predicates.scala:222)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.ordering(predicates.scala:215)
> at 
> org.apache.spark.sql.catalyst.expressions.LessThan.eval(predicates.scala:235)
> at 
> org.apache.spark.sql.catalyst.expressions.MaxFunction.update(aggregates.scala:147)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:165)
> at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$7.apply(Aggregate.scala:149)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1517:
-
Priority: Critical  (was: Blocker)

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7354) Flaky test: o.a.s.deploy.SparkSubmitSuite --jars

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7354:
-
Priority: Critical  (was: Blocker)

> Flaky test: o.a.s.deploy.SparkSubmitSuite --jars
> 
>
> Key: SPARK-7354
> URL: https://issues.apache.org/jira/browse/SPARK-7354
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4114) Use stable Hive API (if one exists) for communication with Metastore

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4114:
-
Priority: Critical  (was: Blocker)

> Use stable Hive API (if one exists) for communication with Metastore
> 
>
> Key: SPARK-4114
> URL: https://issues.apache.org/jira/browse/SPARK-4114
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Wendell
>Priority: Critical
>
> If one exists, we should use a stable API for our communication with the Hive 
> metastore. Specifically, we don't want to have to support compiling against 
> multiple versions of the Hive library to support users with different 
> versions of the Hive metastore.
> I think this is what HCatalog API's are intended for, but I don't know enough 
> about Hive and HCatalog to be sure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3702:
-
Priority: Critical  (was: Blocker)

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7899) PySpark sql/tests breaks pylint validation

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7899:
-
Assignee: Michael Nazario

> PySpark sql/tests breaks pylint validation
> --
>
> Key: SPARK-7899
> URL: https://issues.apache.org/jira/browse/SPARK-7899
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 1.4.0
>Reporter: Michael Nazario
>Assignee: Michael Nazario
> Fix For: 1.5.0
>
>
> The pyspark.sql.types module is dynamically named {{types}} from {{_types}} 
> which messes up pylint validation
> From [~justin.uang] below:
> In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was 
> renamed to {{pyspark/sql/\_types.py}} and then some magic in 
> {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to 
> {{types}}. I imagine that this is some naming conflict with Python 3, but 
> what was the error that showed up?
> The reason why I'm asking about this is because it's messing with pylint, 
> since pylint cannot now statically find the module. I tried also importing 
> the package so that {{\_\_init\_\_}} would be run in a init-hook, but that 
> isn't what the discovery mechanism is using. I imagine it's probably just 
> crawling the directory structure.
> One way to work around this would be something akin to this 
> (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
>  where I would have to create a fake module, but I would probably be missing 
> a ton of pylint features on users of that module, and it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6690) spark-sql script ends up throwing Exception when event logging is enabled.

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6690:
-
Assignee: Marcelo Vanzin

> spark-sql script ends up throwing Exception when event logging is enabled.
> --
>
> Key: SPARK-6690
> URL: https://issues.apache.org/jira/browse/SPARK-6690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Kousuke Saruta
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.4.0
>
>
> When event logging is enabled, spark-sql script ends up throwing Exception 
> like as follows.
> {code}
> 15/04/03 13:51:49 INFO handler.ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/jobs,null}
> 15/04/03 13:51:49 ERROR scheduler.LiveListenerBus: Listener 
> EventLoggingListener threw an exception
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>   at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1171)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> Caused by: java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127)
>   ... 17 more
> 15/04/03 13:51:49 INFO ui.SparkUI: Stopped Spark web UI at 
> http://sarutak-devel:4040
> 15/04/03 13:51:49 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> Exception in thread "Thread-6" java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408)
>   at scala.Option.foreach(Option.scala:236)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1408)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)
> {code}
> This is because FileSystem#close is called by the shutdown hook registered in 
> SparkSQLCLIDriver.
> {code}
> Runtime.getRuntime.addShutdownHook(
>   new Thread() {
> override def run() {
>   SparkSQLEnv.stop()
> }
>   }
> )
> {code}
> This issue was resolved by SPARK-3062 but I think, it's brought again by 
> SPARK-2261.



--
This message was

[jira] [Updated] (SPARK-7717) Spark Standalone Web UI showing incorrect total memory, workers and cores

2015-05-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7717:
-
Assignee: zhichao-li

> Spark Standalone Web UI showing incorrect total memory, workers and cores
> -
>
> Key: SPARK-7717
> URL: https://issues.apache.org/jira/browse/SPARK-7717
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.1
> Environment: RedHat
>Reporter: Swaranga Sarma
>Assignee: zhichao-li
>Priority: Minor
>  Labels: web-ui
> Fix For: 1.5.0
>
> Attachments: JIRA.PNG
>
>
> I launched a Spark master in standalone mode in one of my host and then 
> launched 3 workers on three different hosts. The workers successfully 
> connected to my master and the Web UI showed the correct details. 
> Specifically, the Web UI correctly shows that the total memory and the total 
> cores available for the cluster.
> However on one of the worker, I did a "kill -9 " and 
> restarted the worker again. This time though, the master's Web UI shows 
> incorrect total memory and number of cores. The total memory is shown to be 
> 4*n, where "n" is the memory in each worker. Also the total workers is shown 
> as 4 and the total number of cores shown is incorrect, it shows 4*c, where 
> "c" is the number of cores on each worker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >