[jira] [Created] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'

2016-03-05 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13697:


 Summary: TransformFunctionSerializer.loads doesn't restore the 
function's module name if it's '__main__'
 Key: SPARK-13697
 URL: https://issues.apache.org/jira/browse/SPARK-13697
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Shixiong Zhu


Here is a reproducer
{code}
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction( at 0x106ac8b18>)
>>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
>>> func.rdd_wrap_func, func.deserializers))) 
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
None
>>> print(func2.rdd_wrap_func.__module__)
None
>>> 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13697:


Assignee: Apache Spark

> TransformFunctionSerializer.loads doesn't restore the function's module name 
> if it's '__main__'
> ---
>
> Key: SPARK-13697
> URL: https://issues.apache.org/jira/browse/SPARK-13697
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Here is a reproducer
> {code}
> >>> from pyspark.streaming import StreamingContext
> >>> from pyspark.streaming.util import TransformFunction
> >>> ssc = StreamingContext(sc, 1)
> >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
> >>> func.rdd_wrapper(lambda x: x)
> TransformFunction( at 0x106ac8b18>)
> >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
> >>> func.rdd_wrap_func, func.deserializers))) 
> >>> func2 = ssc._transformerSerializer.loads(bytes)
> >>> print(func2.func.__module__)
> None
> >>> print(func2.rdd_wrap_func.__module__)
> None
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181632#comment-15181632
 ] 

Apache Spark commented on SPARK-13697:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11535

> TransformFunctionSerializer.loads doesn't restore the function's module name 
> if it's '__main__'
> ---
>
> Key: SPARK-13697
> URL: https://issues.apache.org/jira/browse/SPARK-13697
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> Here is a reproducer
> {code}
> >>> from pyspark.streaming import StreamingContext
> >>> from pyspark.streaming.util import TransformFunction
> >>> ssc = StreamingContext(sc, 1)
> >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
> >>> func.rdd_wrapper(lambda x: x)
> TransformFunction( at 0x106ac8b18>)
> >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
> >>> func.rdd_wrap_func, func.deserializers))) 
> >>> func2 = ssc._transformerSerializer.loads(bytes)
> >>> print(func2.func.__module__)
> None
> >>> print(func2.rdd_wrap_func.__module__)
> None
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13523) Reuse the exchanges in a query

2016-03-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-13523:
--

Assignee: Davies Liu

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13523) Reuse the exchanges in a query

2016-03-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-13523:


SPARK-11838 is different than this, that is about cache fragment across queries.

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13697:


Assignee: (was: Apache Spark)

> TransformFunctionSerializer.loads doesn't restore the function's module name 
> if it's '__main__'
> ---
>
> Key: SPARK-13697
> URL: https://issues.apache.org/jira/browse/SPARK-13697
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> Here is a reproducer
> {code}
> >>> from pyspark.streaming import StreamingContext
> >>> from pyspark.streaming.util import TransformFunction
> >>> ssc = StreamingContext(sc, 1)
> >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
> >>> func.rdd_wrapper(lambda x: x)
> TransformFunction( at 0x106ac8b18>)
> >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
> >>> func.rdd_wrap_func, func.deserializers))) 
> >>> func2 = ssc._transformerSerializer.loads(bytes)
> >>> print(func2.func.__module__)
> None
> >>> print(func2.rdd_wrap_func.__module__)
> None
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'

2016-03-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-13697:


Assignee: Shixiong Zhu

> TransformFunctionSerializer.loads doesn't restore the function's module name 
> if it's '__main__'
> ---
>
> Key: SPARK-13697
> URL: https://issues.apache.org/jira/browse/SPARK-13697
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Here is a reproducer
> {code}
> >>> from pyspark.streaming import StreamingContext
> >>> from pyspark.streaming.util import TransformFunction
> >>> ssc = StreamingContext(sc, 1)
> >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
> >>> func.rdd_wrapper(lambda x: x)
> TransformFunction( at 0x106ac8b18>)
> >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, 
> >>> func.rdd_wrap_func, func.deserializers))) 
> >>> func2 = ssc._transformerSerializer.loads(bytes)
> >>> print(func2.func.__module__)
> None
> >>> print(func2.rdd_wrap_func.__module__)
> None
> >>> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13691) Scala and Python generate inconsistent results

2016-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181640#comment-15181640
 ] 

Sean Owen commented on SPARK-13691:
---

I think this is just a language difference? Although changing it might bring 
Pyspark closer to Scala Spark, would it just make it behave less like Python?

> Scala and Python generate inconsistent results
> --
>
> Key: SPARK-13691
> URL: https://issues.apache.org/jira/browse/SPARK-13691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>
> Here is an example that Scala and Python generate different results
> {code}
> Scala:
> scala> var i = 0
> i: Int = 0
> scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
> scala> rdd.collect()
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> scala> i += 1
> scala> rdd.collect()
> res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
> Python:
> >>> i = 0
> >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> >>> i += 1
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> {code}
> The difference is Scala will capture all variables' values when running a job 
> every time, but Python just captures variables' values once and always uses 
> them for all jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181647#comment-15181647
 ] 

Sean Owen commented on SPARK-13684:
---

Hm OK this turns out to be a non-issue. I guess I think it's worth removing 
volatile then. Like the scanner, I'd also read this assuming it was there for 
correctness when multiple threads access and then maybe be surprised it's not 
used that way. @holdenk I'd say it's your call what to do.

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13629:


Assignee: Apache Spark

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181648#comment-15181648
 ] 

Apache Spark commented on SPARK-13629:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/11536

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13629:


Assignee: (was: Apache Spark)

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13616) Let SQLBuilder convert logical plan without a Project on top of it

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181651#comment-15181651
 ] 

Apache Spark commented on SPARK-13616:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11537

> Let SQLBuilder convert logical plan without a Project on top of it
> --
>
> Key: SPARK-13616
> URL: https://issues.apache.org/jira/browse/SPARK-13616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> It is possibly that a logical plan has been removed Project from the top of 
> it. Or the plan doesn't has a top Project from the beginning. Currently the 
> SQLBuilder can't convert such plans back to SQL. This issue is opened to add 
> this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate

2016-03-05 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-13698:


 Summary: Fix Analysis Exceptions when Using Backticks in Generate
 Key: SPARK-13698
 URL: https://issues.apache.org/jira/browse/SPARK-13698
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Dilip Biswal


Analysis exception occurs while running the following query.
{code}
SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
{code}
{code}
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve 
'`ints`' given input columns: [a, `ints`]; line 1 pos 7
'Project ['ints]
+- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
   +- SubqueryAlias nestedarray
  +- LocalRelation [a#0], 1,2,3
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181657#comment-15181657
 ] 

Apache Spark commented on SPARK-13698:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11538

> Fix Analysis Exceptions when Using Backticks in Generate
> 
>
> Key: SPARK-13698
> URL: https://issues.apache.org/jira/browse/SPARK-13698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>
> Analysis exception occurs while running the following query.
> {code}
> SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
> {code}
> {code}
> Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot 
> resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
> 'Project ['ints]
> +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
>+- SubqueryAlias nestedarray
>   +- LocalRelation [a#0], 1,2,3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13698:


Assignee: (was: Apache Spark)

> Fix Analysis Exceptions when Using Backticks in Generate
> 
>
> Key: SPARK-13698
> URL: https://issues.apache.org/jira/browse/SPARK-13698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>
> Analysis exception occurs while running the following query.
> {code}
> SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
> {code}
> {code}
> Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot 
> resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
> 'Project ['ints]
> +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
>+- SubqueryAlias nestedarray
>   +- LocalRelation [a#0], 1,2,3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13698:


Assignee: Apache Spark

> Fix Analysis Exceptions when Using Backticks in Generate
> 
>
> Key: SPARK-13698
> URL: https://issues.apache.org/jira/browse/SPARK-13698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>
> Analysis exception occurs while running the following query.
> {code}
> SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
> {code}
> {code}
> Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot 
> resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
> 'Project ['ints]
> +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
>+- SubqueryAlias nestedarray
>   +- LocalRelation [a#0], 1,2,3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13616) Let SQLBuilder convert logical plan without a Project on top of it

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181674#comment-15181674
 ] 

Apache Spark commented on SPARK-13616:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11539

> Let SQLBuilder convert logical plan without a Project on top of it
> --
>
> Key: SPARK-13616
> URL: https://issues.apache.org/jira/browse/SPARK-13616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> It is possibly that a logical plan has been removed Project from the top of 
> it. Or the plan doesn't has a top Project from the beginning. Currently the 
> SQLBuilder can't convert such plans back to SQL. This issue is opened to add 
> this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12720:
---
Fix Version/s: 2.0.0

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12720.

Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/11283

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13692:
--
Description: 
This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.
  * Remove unused imports in `ColumnarBatch`.
  * Remove unused fields in `ChunkFetchIntegrationSuite`.
  * Add `close()` to prevent resource leak.


Please note that the last two checkstyle errors exist on newly added commits 
after [SPARK-13583].

  was:
This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.
  * Remove unused imports in `ColumnarBatch`.

Please note that the last two checkstyle errors exist on newly added commits 
after [SPARK-13583].


> Fix trivial Coverity/Checkstyle defects
> ---
>
> Key: SPARK-13692
> URL: https://issues.apache.org/jira/browse/SPARK-13692
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes the following potential bugs and Java coding style detected 
> by Coverity and Checkstyle.
>   * Implement both null and type checking in equals functions.
>   * Fix wrong type casting logic in SimpleJavaBean2.equals.
>   * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
>   * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
>   * Fix coding style: Add '{}' to single `for` statement in mllib examples.
>   * Remove unused imports in `ColumnarBatch`.
>   * Remove unused fields in `ChunkFetchIntegrationSuite`.
>   * Add `close()` to prevent resource leak.
> Please note that the last two checkstyle errors exist on newly added commits 
> after [SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13699) Spark SQL drop the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)
Dhaval Modi created SPARK-13699:
---

 Summary: Spark SQL drop the table in "overwrite" mode while 
writing into table
 Key: SPARK-13699
 URL: https://issues.apache.org/jira/browse/SPARK-13699
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Dhaval Modi
Priority: Blocker


Hi,

While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.

E.g.
tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")

sqlContext drop the table instead of truncating.


Thanks & Regards,
Dhaval






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13699) Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhaval Modi updated SPARK-13699:

Summary: Spark SQL drops the HIVE table in "overwrite" mode while writing 
into table  (was: Spark SQL drop the table in "overwrite" mode while writing 
into table)

> Spark SQL drops the HIVE table in "overwrite" mode while writing into table
> ---
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>Priority: Blocker
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181750#comment-15181750
 ] 

Xiao Li commented on SPARK-13699:
-

In RDBMS, I know truncate is much faster than drop-and-then-recreate due to 
logging issues. How about HIVE?

> Spark SQL drops the HIVE table in "overwrite" mode while writing into table
> ---
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>Priority: Blocker
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13699) Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13699:

Comment: was deleted

(was: In RDBMS, I know truncate is much faster than drop-and-then-recreate due 
to logging issues. How about HIVE?)

> Spark SQL drops the HIVE table in "overwrite" mode while writing into table
> ---
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>Priority: Blocker
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181752#comment-15181752
 ] 

Xiao Li commented on SPARK-13699:
-

So far, this is by design. Thus, it is hard to say it is a bug.

However, I think this idea is reasonable, especially for RDBMS users. Let me 
try this. Thanks!

> Spark SQL drops the HIVE table in "overwrite" mode while writing into table
> ---
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>Priority: Blocker
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13699) Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13699:

Priority: Major  (was: Blocker)

> Spark SQL drops the HIVE table in "overwrite" mode while writing into table
> ---
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13699) Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13699:

Issue Type: Improvement  (was: Bug)

> Spark SQL drops the HIVE table in "overwrite" mode while writing into table
> ---
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>Priority: Blocker
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13699:

Summary: Spark SQL drops the table in "overwrite" mode while writing into 
table  (was: Spark SQL drops the HIVE table in "overwrite" mode while writing 
into table)

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.

2016-03-05 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181776#comment-15181776
 ] 

Nicholas Chammas commented on SPARK-7505:
-

I believe items 1, 3, and 4 still apply. They're minor documentation issues, 
but I think they should still be addressed.

> Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
> etc.
> 
>
> Key: SPARK-7505
> URL: https://issues.apache.org/jira/browse/SPARK-7505
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The PySpark docs for DataFrame need the following fixes and improvements:
> # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
> {{\_\_getattr\_\_}} and change all our examples accordingly.
> # *We should say clearly that the API is experimental.* (That is currently 
> not the case for the PySpark docs.)
> # We should provide an example of how to join and select from 2 DataFrames 
> that have identically named columns, because it is not obvious:
>   {code}
> >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
> >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
> >>> df12 = df1.join(df2, df1['a'] == df2['a'])
> >>> df12.select(df1['a'], df2['other']).show()
> a other   
> 
> 4 I dunno  {code}
> # 
> [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
>  and 
> [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
>  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhaval Modi updated SPARK-13699:

Description: 
Hi,

While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.

E.g.
tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")

sqlContext drop the table instead of truncating.

This is causing error while overwriting.

Adding stacktrace & commands to reproduce the issue,

Thanks & Regards,
Dhaval




  was:
Hi,

While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.

E.g.
tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")

sqlContext drop the table instead of truncating.


Thanks & Regards,
Dhaval





> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhaval Modi updated SPARK-13699:

Attachment: stackTrace.txt

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181777#comment-15181777
 ] 

Dhaval Modi commented on SPARK-13699:
-

This should be a bug, as it fails to overwrite the table, throwing error.

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181780#comment-15181780
 ] 

Dhaval Modi commented on SPARK-13699:
-


== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write().mode(SaveMode.Append).saveAsTable(tgt_table)

=== Code Snippet =

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181780#comment-15181780
 ] 

Dhaval Modi edited comment on SPARK-13699 at 3/5/16 5:29 PM:
-

== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write().mode(SaveMode.Overwrite).saveAsTable(tgt_table)

=== Code Snippet =


was (Author: mysti):

== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write().mode(SaveMode.Append).saveAsTable(tgt_table)

=== Code Snippet =

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181780#comment-15181780
 ] 

Dhaval Modi edited comment on SPARK-13699 at 3/5/16 5:30 PM:
-

== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable(tgt_table)

=== Code Snippet =


was (Author: mysti):
== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write().mode(SaveMode.Overwrite).saveAsTable(tgt_table)

=== Code Snippet =

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181780#comment-15181780
 ] 

Dhaval Modi edited comment on SPARK-13699 at 3/5/16 5:30 PM:
-

== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")

=== Code Snippet =


was (Author: mysti):
== Code Snippet ===
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val src=sqlContext.sql("select * from src_table");
val tgt=sqlContext.sql("select * from tgt_table");

var tgtFinal=tgt.filter("currind = 'N'"); //Add to final table
val tgtActive=tgt.filter("currind = 'Y'");



#src.select("col1").except(src.select("col1").as('a).join(tgtActive.select("col1").as('b),"col1"))


val newTgt1 = tgtActive.as('a).join(src.as('b),$"a.col1" === $"b.col1")

#val newTgt2 = tgtActive.except(newTgt1.select("a.*"));
tgtFinal = tgtFinal.unionAll(tgtActive.except(newTgt1.select("a.*")));

var srcInsert = src.except(newTgt1.select("b.*"))

import org.apache.spark.sql._

val inBatchID = udf((t:String) => "13" )
val inCurrInd = udf((t:String) => "Y" )
val NCurrInd = udf((t:String) => "N" )
val endDate = udf((t:String) => "-12-31 23:59:59")

tgtFinal = tgtFinal.unionAll(newTgt1.select("a.*").withColumn("currInd", 
NCurrInd(col("col1"))).withColumn("endDate", 
current_timestamp()).withColumn("updateDate", current_timestamp()))


srcInsert = src.withColumn("batchId", 
inBatchID(col("col1"))).withColumn("currInd", 
inCurrInd(col("col1"))).withColumn("startDate", 
current_timestamp()).withColumn("endDate", 
date_format(endDate(col("col1")),"-MM-dd 
HH:mm:ss")).withColumn("updateDate", current_timestamp())

tgtFinal = tgtFinal.unionAll(srcInsert)

tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable(tgt_table)

=== Code Snippet =

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181790#comment-15181790
 ] 

Łukasz Gieroń commented on SPARK-13230:
---

A good workaround is to use Kryo serializer. I've checked and the code works 
with Kryo.

I've created a Scala ticket for this issue and a pull request fixing it. With 
any luck, the fix will be included in Scala 2.11.9.
https://issues.scala-lang.org/browse/SI-9687

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181791#comment-15181791
 ] 

Xiao Li commented on SPARK-13699:
-

Now, I see your points. Will take a look at it. Thanks!

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181804#comment-15181804
 ] 

Xiao Li commented on SPARK-13699:
-

After a research, we can NOT truncate the table if the table is created with 
EXTERNAL keyword, because all data resides outside of Hive Meta store. 

[~yhuai] Is that the reason why we chose drop-and-then-recreate the Hive table 
instead of truncate the table when the mode is SaveMode.Overwrite?

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13700) Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation

2016-03-05 Thread Paulo Costa (JIRA)
Paulo Costa created SPARK-13700:
---

 Summary: Rdd.mapAsync(): Easily mix Spark and asynchroneous 
transformation
 Key: SPARK-13700
 URL: https://issues.apache.org/jira/browse/SPARK-13700
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Paulo Costa
Priority: Minor


Spark is great for synchronous operations.

But sometimes I need to call a web database/web server/etc from my transform, 
and the Spark pipeline stalls waiting for it.

Avoiding that would be great!

I suggest we add a new method RDD.mapAsync(), which can execute these 
operations concurrently, avoiding the bottleneck.

I've written a quick'n'dirty implementation of what I have in mind: 
https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3

What do you think?

If you agree with this feature, I can work on a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13700) Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation

2016-03-05 Thread Paulo Costa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Costa updated SPARK-13700:

Description: 
Spark is great for synchronous operations.

But sometimes I need to call a database/web server/etc from my transform, and 
the Spark pipeline stalls waiting for it.

Avoiding that would be great!

I suggest we add a new method RDD.mapAsync(), which can execute these 
operations concurrently, avoiding the bottleneck.

I've written a quick'n'dirty implementation of what I have in mind: 
https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3

What do you think?

If you agree with this feature, I can work on a pull request.

  was:
Spark is great for synchronous operations.

But sometimes I need to call a web database/web server/etc from my transform, 
and the Spark pipeline stalls waiting for it.

Avoiding that would be great!

I suggest we add a new method RDD.mapAsync(), which can execute these 
operations concurrently, avoiding the bottleneck.

I've written a quick'n'dirty implementation of what I have in mind: 
https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3

What do you think?

If you agree with this feature, I can work on a pull request.


> Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation
> -
>
> Key: SPARK-13700
> URL: https://issues.apache.org/jira/browse/SPARK-13700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Paulo Costa
>Priority: Minor
>  Labels: async, features, rdd, transform
>
> Spark is great for synchronous operations.
> But sometimes I need to call a database/web server/etc from my transform, and 
> the Spark pipeline stalls waiting for it.
> Avoiding that would be great!
> I suggest we add a new method RDD.mapAsync(), which can execute these 
> operations concurrently, avoiding the bottleneck.
> I've written a quick'n'dirty implementation of what I have in mind: 
> https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3
> What do you think?
> If you agree with this feature, I can work on a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-05 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181819#comment-15181819
 ] 

Gayathri Murali commented on SPARK-13641:
-

[~xusen] Can you list the steps to reproduce the bug? 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13700) Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13700:
---
Affects Version/s: (was: 1.6.0)

> Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation
> -
>
> Key: SPARK-13700
> URL: https://issues.apache.org/jira/browse/SPARK-13700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paulo Costa
>Priority: Minor
>  Labels: async, features, rdd, transform
>
> Spark is great for synchronous operations.
> But sometimes I need to call a database/web server/etc from my transform, and 
> the Spark pipeline stalls waiting for it.
> Avoiding that would be great!
> I suggest we add a new method RDD.mapAsync(), which can execute these 
> operations concurrently, avoiding the bottleneck.
> I've written a quick'n'dirty implementation of what I have in mind: 
> https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3
> What do you think?
> If you agree with this feature, I can work on a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13700) Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13700:
---
Target Version/s:   (was: 1.6.0)

> Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation
> -
>
> Key: SPARK-13700
> URL: https://issues.apache.org/jira/browse/SPARK-13700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paulo Costa
>Priority: Minor
>  Labels: async, features, rdd, transform
>
> Spark is great for synchronous operations.
> But sometimes I need to call a database/web server/etc from my transform, and 
> the Spark pipeline stalls waiting for it.
> Avoiding that would be great!
> I suggest we add a new method RDD.mapAsync(), which can execute these 
> operations concurrently, avoiding the bottleneck.
> I've written a quick'n'dirty implementation of what I have in mind: 
> https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3
> What do you think?
> If you agree with this feature, I can work on a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13352) BlockFetch does not scale well on large block

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13352:
---
Component/s: Block Manager

> BlockFetch does not scale well on large block
> -
>
> Key: SPARK-13352
> URL: https://issues.apache.org/jira/browse/SPARK-13352
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Davies Liu
>
> BlockManager.getRemoteBytes() perform poorly on large block
> {code}
>   test("block manager") {
> val N = 500 << 20
> val bm = sc.env.blockManager
> val blockId = TaskResultBlockId(0)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
> val result = bm.getRemoteBytes(blockId)
> assert(result.isDefined)
> assert(result.get.limit() === (N))
>   }
> {code}
> Here are runtime for different block sizes:
> {code}
> 50M3 seconds
> 100M  7 seconds
> 250M  33 seconds
> 500M 2 min
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13365) should coalesce do anything if coalescing to same number of partitions without shuffle

2016-03-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181827#comment-15181827
 ] 

Josh Rosen commented on SPARK-13365:


If coalesce is called with {{shuffle == true}} then we might actually want to 
run the coalesce because the user's intent might be to produce more 
evenly-balanced partitions. If {{shuffle == false}}, though, then it seems fine 
to skip the coalesce since it would be a no-op. I believe that Spark SQL 
performs a similar optimization.

> should coalesce do anything if coalescing to same number of partitions 
> without shuffle
> --
>
> Key: SPARK-13365
> URL: https://issues.apache.org/jira/browse/SPARK-13365
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Thomas Graves
>
> Currently if a user does a coalesce to the same number of partitions as 
> already exist it spends a bunch of time doing stuff when it seems like it 
> shouldn't do anything.
> for instance I have an RDD with 100 partitions if I run coalesce(100) it 
> seems like it should skip any computation since it already has 100 
> partitions.  One case I've seen this is actually when users do coalesce(1000) 
> without the shuffle which really turns into a coalesce(100).
> I'm presenting this as a question as I'm not sure if there are use cases I 
> haven't thought of where this would break.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12076) countDistinct behaves inconsistently

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12076:
---
Component/s: (was: Spark Core)
 SQL

> countDistinct behaves inconsistently
> 
>
> Key: SPARK-12076
> URL: https://issues.apache.org/jira/browse/SPARK-12076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Paul Zaczkieiwcz
>Priority: Minor
>
> Assume:
> {code:java}
> val slicePlayed:DataFrame = _
> val joinKeys:DataFrame = _
> {code}
> Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} 
> and all columns beginning with "join_" are from {{joinKeys}}.  The following 
> queries can return different values for slice_count_distinct:
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number")
> ).show(false)
> {code}
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   min($"cdnt_event_time").as("slice_start_time"),
>   min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"),
>   min($"cdnt_user_ip").as("slice_played_user_ip"),
>   min($"cdnt_user_agent").as("slice_played_user_agent"),
>   min($"cdnt_referer").as("slice_played_referer"),
>   max($"cdnt_event_time").as("slice_end_time"),
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number"),
>   min($"cdnt_is_live").as("is_live")
> ).show(false)
> {code}
> The +only+ difference between the two queries are that I'm adding more 
> columns to the {{agg}} method.
> I can't reproduce by manually creating a dataFrame from 
> {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet 
> files.
> The explain plans for the two queries are slightly different.
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDistinct=false),(count(cdnt_slice_number#24L),mode=Complete,isDistinct=true)],
>  
> output=[slice_played_session_id#780,slice_played_asset_id#781,slice_played_euid#782,slice_count_distinct#783L,slice_count_total#784L,min_slice_number#785L,max_slice_number#786L])
>  
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L],
>  
> functions=[(count(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false),(min(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false),(max(cdnt_slice_number#24L),mode=PartialMerge,isDistinct=false)],
>  
> output=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L,currentCount#795L,min#797L,max#799L])
>   
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L],
>  
> functions=[(count(cdnt_slice_number#24L),mode=Partial,isDistinct=false),(min(cdnt_slice_number#24L),mode=Partial,isDistinct=false),(max(cdnt_slice_number#24L),mode=Partial,isDistinct=false)],
>  
> output=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L,currentCount#795L,min#797L,max#799L])
>TungstenProject 
> [cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13,cdnt_slice_number#24L]
> SortMergeJoin [cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> [join_session_id#41,join_asset_id#42,join_euid#43]
>  TungstenSort [cdnt_session_id#23 ASC,cdnt_asset_id#5 ASC,cdnt_euid#13 
> ASC], false, 0
>   TungstenExchange 
> hashpartitioning(cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13)
>ConvertToUnsafe
> Scan 
> ParquetRelation[hdfs://hadoop-namenode1:8020/user/hive/warehouse/src_cdn_events][cdnt_slice_number#24L,cdnt_euid#13,cdnt_asset_id#5,cdnt_session_id#23]
>  TungstenSort [join_session_id#41 ASC,join_asset_id#42 ASC,join_euid#43 
> ASC], fal

[jira] [Updated] (SPARK-1762) Add functionality to pin RDDs in cache

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1762:
--
Component/s: Block Manager

> Add functionality to pin RDDs in cache
> --
>
> Key: SPARK-1762
> URL: https://issues.apache.org/jira/browse/SPARK-1762
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> Right now, all RDDs are created equal, and there is no mechanism to identify 
> a certain RDD to be more important than the rest. This is a problem if the 
> RDD fraction is small, because just caching a few RDDs can evict more 
> important ones.
> A side effect of this feature is that we can now more safely allocate a 
> smaller spark.storage.memoryFraction if we know how large our important RDDs 
> are, without having to worry about them being evicted. This allows us to use 
> more memory for shuffles, for instance, and avoid disk spills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8790:
--
Component/s: Block Manager

> BlockManager.reregister cause OOM
> -
>
> Key: SPARK-8790
> URL: https://issues.apache.org/jira/browse/SPARK-8790
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Patrick Liu
> Attachments: driver.log, executor.log, webui-executor.png, 
> webui-slow-task.png
>
>
> We run SparkSQL 1.2.1 on Yarn.
> A SQL consists of 100 tasks, most them finish in < 10s, but only 1 lasts for 
> 16m.
> The webUI shows that the executor has running GC for 15m brfore OOM.
> The log shows that the executor first try to connect to master to report 
> broadcast value, however the network is not available, so the executor lost 
> heartbeat to Master. 
> Then the master require the executor to reregister. When executor are 
> reporAllBlocks to master, the network is still not so stable, sometimes 
> time-out.
> Finally, the executor OOM.
> Please take a look.
> Attached is the detailed log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3151) DiskStore attempts to map any size BlockId without checking MappedByteBuffer limit

2016-03-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3151:
--
Component/s: Block Manager

> DiskStore attempts to map any size BlockId without checking MappedByteBuffer 
> limit
> --
>
> Key: SPARK-3151
> URL: https://issues.apache.org/jira/browse/SPARK-3151
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.2
> Environment: IBM 64-bit JVM PPC64
>Reporter: Damon Brown
>Priority: Minor
>
> [DiskStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala]
>  attempts to memory map the block file in {{def getBytes}}.  If the file is 
> larger than 2GB (Integer.MAX_VALUE) as specified by 
> [FileChannel.map|http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map%28java.nio.channels.FileChannel.MapMode,%20long,%20long%29],
>  then the memory map fails.
> {code}
> Some(channel.map(MapMode.READ_ONLY, segment.offset, segment.length)) # line 
> 104
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-05 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181872#comment-15181872
 ] 

Xusen Yin commented on SPARK-13641:
---

You can checkout code from https://github.com/apache/spark/pull/11486.

Run ./bin/sparkR with this [test 
example|https://github.com/yinxusen/spark/blob/SPARK-13449/R/pkg/inst/tests/testthat/test_mllib.R#L145].
 With summary(model) you can see the column names are not the original.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-03-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13693.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13701) MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm))

2016-03-05 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-13701:


 Summary: MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: 
org.jblas.NativeBlas.dgemm))
 Key: SPARK-13701
 URL: https://issues.apache.org/jira/browse/SPARK-13701
 Project: Spark
  Issue Type: Bug
  Components: MLlib
 Environment: Ubuntu 14.04 on aarch64
Reporter: Santiago M. Mola
Priority: Minor


jblas fails on arm64.

{code}
ALSSuite:
Exception encountered when attempting to run a suite with class name: 
org.apache.spark.mllib.recommendation.ALSSuite *** ABORTED *** (112 
milliseconds)
  java.lang.UnsatisfiedLinkError: 
org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
  at org.jblas.NativeBlas.dgemm(Native Method)
  at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
  at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781)
  at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138)
  at 
org.apache.spark.mllib.recommendation.ALSSuite$.generateRatings(ALSSuite.scala:74)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13702) Use diamond operator for generic instance creation in Java code

2016-03-05 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13702:
-

 Summary: Use diamond operator for generic instance creation in 
Java code
 Key: SPARK-13702
 URL: https://issues.apache.org/jira/browse/SPARK-13702
 Project: Spark
  Issue Type: Improvement
Reporter: Dongjoon Hyun
Priority: Trivial


Java 7 or higher supports `diamond` operator which replaces the type arguments 
required to invoke the constructor of a generic class with an empty set of type 
parameters (<>). Currently, Spark Java code use mixed usage of this. This issue 
replaces existing codes to use `diamond` operator and add Checkstyle rule.

{code}
-List> kafkaStreams = new 
ArrayList>(numStreams);
+List> kafkaStreams = new 
ArrayList<>(numStreams);
{code}

{code}
-Set> edges = new HashSet>(numEdges);
+Set> edges = new HashSet<>(numEdges);
{code}

*Reference*
https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13702) Use diamond operator for generic instance creation in Java code

2016-03-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13702:
--
Description: 
Java 7 or higher supports `diamond` operator which replaces the type arguments 
required to invoke the constructor of a generic class with an empty set of type 
parameters (<>). Currently, Spark Java code use mixed usage of this. This issue 
replaces existing codes to use `diamond` operator and add Checkstyle rule.

{code}
-List> kafkaStreams = new 
ArrayList>(numStreams);
+List> kafkaStreams = new 
ArrayList<>(numStreams);
{code}

{code}
-Set> edges = new HashSet>(numEdges);
+Set> edges = new HashSet<>(numEdges);
{code}

*Reference*
https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html

  was:
Java 7 or higher supports `diamond` operator which replaces the type arguments 
required to invoke the constructor of a generic class with an empty set of type 
parameters (<>). Currently, Spark Java code use mixed usage of this. This issue 
replaces existing codes to use `diamond` operator and add Checkstyle rule.

{code}
-List> kafkaStreams = new 
ArrayList>(numStreams);
+List> kafkaStreams = new 
ArrayList<>(numStreams);
{code}

{code}
-Set> edges = new HashSet>(numEdges);
+Set> edges = new HashSet<>(numEdges);
{code}

*Reference*
https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html


> Use diamond operator for generic instance creation in Java code
> ---
>
> Key: SPARK-13702
> URL: https://issues.apache.org/jira/browse/SPARK-13702
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Java 7 or higher supports `diamond` operator which replaces the type 
> arguments required to invoke the constructor of a generic class with an empty 
> set of type parameters (<>). Currently, Spark Java code use mixed usage of 
> this. This issue replaces existing codes to use `diamond` operator and add 
> Checkstyle rule.
> {code}
> -List> kafkaStreams = new 
> ArrayList>(numStreams);
> +List> kafkaStreams = new 
> ArrayList<>(numStreams);
> {code}
> {code}
> -Set> edges = new HashSet Integer>>(numEdges);
> +Set> edges = new HashSet<>(numEdges);
> {code}
> *Reference*
> https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13702) Use diamond operator for generic instance creation in Java code

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13702:


Assignee: (was: Apache Spark)

> Use diamond operator for generic instance creation in Java code
> ---
>
> Key: SPARK-13702
> URL: https://issues.apache.org/jira/browse/SPARK-13702
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Java 7 or higher supports `diamond` operator which replaces the type 
> arguments required to invoke the constructor of a generic class with an empty 
> set of type parameters (<>). Currently, Spark Java code use mixed usage of 
> this. This issue replaces existing codes to use `diamond` operator and add 
> Checkstyle rule.
> {code}
> -List> kafkaStreams = new 
> ArrayList>(numStreams);
> +List> kafkaStreams = new 
> ArrayList<>(numStreams);
> {code}
> {code}
> -Set> edges = new HashSet Integer>>(numEdges);
> +Set> edges = new HashSet<>(numEdges);
> {code}
> *Reference*
> https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13702) Use diamond operator for generic instance creation in Java code

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181913#comment-15181913
 ] 

Apache Spark commented on SPARK-13702:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11541

> Use diamond operator for generic instance creation in Java code
> ---
>
> Key: SPARK-13702
> URL: https://issues.apache.org/jira/browse/SPARK-13702
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Java 7 or higher supports `diamond` operator which replaces the type 
> arguments required to invoke the constructor of a generic class with an empty 
> set of type parameters (<>). Currently, Spark Java code use mixed usage of 
> this. This issue replaces existing codes to use `diamond` operator and add 
> Checkstyle rule.
> {code}
> -List> kafkaStreams = new 
> ArrayList>(numStreams);
> +List> kafkaStreams = new 
> ArrayList<>(numStreams);
> {code}
> {code}
> -Set> edges = new HashSet Integer>>(numEdges);
> +Set> edges = new HashSet<>(numEdges);
> {code}
> *Reference*
> https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13702) Use diamond operator for generic instance creation in Java code

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13702:


Assignee: Apache Spark

> Use diamond operator for generic instance creation in Java code
> ---
>
> Key: SPARK-13702
> URL: https://issues.apache.org/jira/browse/SPARK-13702
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> Java 7 or higher supports `diamond` operator which replaces the type 
> arguments required to invoke the constructor of a generic class with an empty 
> set of type parameters (<>). Currently, Spark Java code use mixed usage of 
> this. This issue replaces existing codes to use `diamond` operator and add 
> Checkstyle rule.
> {code}
> -List> kafkaStreams = new 
> ArrayList>(numStreams);
> +List> kafkaStreams = new 
> ArrayList<>(numStreams);
> {code}
> {code}
> -Set> edges = new HashSet Integer>>(numEdges);
> +Set> edges = new HashSet<>(numEdges);
> {code}
> *Reference*
> https://docs.oracle.com/javase/8/docs/technotes/guides/language/type-inference-generic-instance-creation.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181916#comment-15181916
 ] 

Xiao Li commented on SPARK-13699:
-

[~mysti]  Could you show the script how you create the original tables, 
especially `tgt_table`? 

> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13701) MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dgemm))

2016-03-05 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181921#comment-15181921
 ] 

Santiago M. Mola commented on SPARK-13701:
--

This is probably just gfortran not being installed? I'll test as soon as 
possible.

> MLlib ALS fails on arm64 (java.lang.UnsatisfiedLinkError: 
> org.jblas.NativeBlas.dgemm))
> --
>
> Key: SPARK-13701
> URL: https://issues.apache.org/jira/browse/SPARK-13701
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
> Environment: Ubuntu 14.04 on aarch64
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: arm64, porting
>
> jblas fails on arm64.
> {code}
> ALSSuite:
> Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.mllib.recommendation.ALSSuite *** ABORTED *** (112 
> milliseconds)
>   java.lang.UnsatisfiedLinkError: 
> org.jblas.NativeBlas.dgemm(CCIIID[DII[DIID[DII)V
>   at org.jblas.NativeBlas.dgemm(Native Method)
>   at org.jblas.SimpleBlas.gemm(SimpleBlas.java:247)
>   at org.jblas.DoubleMatrix.mmuli(DoubleMatrix.java:1781)
>   at org.jblas.DoubleMatrix.mmul(DoubleMatrix.java:3138)
>   at 
> org.apache.spark.mllib.recommendation.ALSSuite$.generateRatings(ALSSuite.scala:74)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-05 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-13703:
---

 Summary: Remove obsolete scala-2.10 source files
 Key: SPARK-13703
 URL: https://issues.apache.org/jira/browse/SPARK-13703
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Luciano Resende
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13703:


Assignee: (was: Apache Spark)

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181929#comment-15181929
 ] 

Apache Spark commented on SPARK-13703:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/11542

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13703) Remove obsolete scala-2.10 source files

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13703:


Assignee: Apache Spark

> Remove obsolete scala-2.10 source files
> ---
>
> Key: SPARK-13703
> URL: https://issues.apache.org/jira/browse/SPARK-13703
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10380) Confusing examples in pyspark SQL docs

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181942#comment-15181942
 ] 

Apache Spark commented on SPARK-10380:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/11543

> Confusing examples in pyspark SQL docs
> --
>
> Key: SPARK-10380
> URL: https://issues.apache.org/jira/browse/SPARK-10380
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Reporter: Michael Armbrust
>Priority: Minor
>  Labels: docs, starter
>
> There’s an error in the astype() documentation, as it uses cast instead of 
> astype. It should probably include a mention that astype is an alias for cast 
> (and vice versa in the cast documentation): 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.astype
>  
> The same error occurs with drop_duplicates and dropDuplicates: 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates
>  
> The issue here is we are copying the code.  According to [~davies] the 
> easiest way is to copy the method and just add new docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12718) SQL generation support for window functions

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181967#comment-15181967
 ] 

Xiao Li edited comment on SPARK-12718 at 3/6/16 2:59 AM:
-

{code}
select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
{code}
->
{code}
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1 where value < 10) t2
{code}
When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer 
first. This is still missing now. I also found Hive just added one. Thanks!


was (Author: smilegator):
select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
->
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1 where value < 10) t2

When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer 
first. This is still missing now. I also found Hive just added one. Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181967#comment-15181967
 ] 

Xiao Li commented on SPARK-12718:
-

select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
->
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1 where value < 10) t2

When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer 
first. This is still missing now. I also found Hive just added one. Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12718) SQL generation support for window functions

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181973#comment-15181973
 ] 

Xiao Li commented on SPARK-12718:
-

Just confirmed that no optimizer rule needs to be added. 
{code}
SELECT  t.p_mfgr, 
t.p_name, 
t.p_size, 
t.dr, 
rank() OVER (distribute BY p_mfgr sort BY p_name, p_mfgr) AS r 
FROM( 
SELECT  p_mfgr, 
p_name, 
p_size, 
dense_rank() OVER (distribute BY p_mfgr sort BY p_name) 
AS dr 
FROMpart) t
{code}

{code}
== Analyzed Logical Plan ==
p_mfgr: string, p_name: string, p_size: int, dr: int, r: int
Project [p_mfgr#60,p_name#59,p_size#63,dr#28,r#29]
+- Project [p_mfgr#60,p_name#59,p_size#63,dr#28,r#29,r#29]
   +- Window [p_mfgr#60,p_name#59,p_size#63,dr#28], [rank(p_name#59, p_mfgr#60) 
windowspecdefinition(p_mfgr#60, p_name#59 ASC, p_mfgr#60 ASC, ROWS BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW) AS r#29], [p_mfgr#60], [p_name#59 
ASC,p_mfgr#60 ASC]
  +- Project [p_mfgr#60,p_name#59,p_size#63,dr#28]
 +- SubqueryAlias t
+- Project [p_mfgr#60,p_name#59,p_size#63,dr#28]
   +- Project [p_mfgr#60,p_name#59,p_size#63,dr#28,dr#28]
  +- Window [p_mfgr#60,p_name#59,p_size#63], 
[denserank(p_name#59) windowspecdefinition(p_mfgr#60, p_name#59 ASC, ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS dr#28], [p_mfgr#60], [p_name#59 
ASC]
 +- Project [p_mfgr#60,p_name#59,p_size#63]
+- MetastoreRelation default, part, None
{code}

{code}
== Optimized Logical Plan ==
Window [p_mfgr#60,p_name#59,p_size#63,dr#28], [rank(p_name#59, p_mfgr#60) 
windowspecdefinition(p_mfgr#60, p_name#59 ASC, p_mfgr#60 ASC, ROWS BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW) AS r#29], [p_mfgr#60], [p_name#59 
ASC,p_mfgr#60 ASC]
+- Window [p_mfgr#60,p_name#59,p_size#63], [denserank(p_name#59) 
windowspecdefinition(p_mfgr#60, p_name#59 ASC, ROWS BETWEEN UNBOUNDED PRECEDING 
AND CURRENT ROW) AS dr#28], [p_mfgr#60], [p_name#59 ASC]
   +- Project [p_mfgr#60,p_name#59,p_size#63]
  +- MetastoreRelation default, part, None
{code}


> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12718) SQL generation support for window functions

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181967#comment-15181967
 ] 

Xiao Li edited comment on SPARK-12718 at 3/6/16 3:28 AM:
-

{code}
select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
{code}
->
{code}
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1) t2
{code}
When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer 
first. This is still missing now. I also found Hive just added one. Thanks!


was (Author: smilegator):
{code}
select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
{code}
->
{code}
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1 where value < 10) t2
{code}
When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer 
first. This is still missing now. I also found Hive just added one. Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12718) SQL generation support for window functions

2016-03-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181967#comment-15181967
 ] 

Xiao Li edited comment on SPARK-12718 at 3/6/16 3:33 AM:
-

{code}
select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
{code}
->
{code}
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1) t2
{code}
When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer. 
This is still missing now. I also found Hive just added one. Thanks!


was (Author: smilegator):
{code}
select key, value, avg(c_int) over (partition by key), sum(c_float) 
over(partition by value) from t1
{code}
->
{code}
select key, value, avg(c_int), t2._w0 over (partition by key) from (select key, 
value, sum(c_float) over(partition by value) as _w0 from t1) t2
{code}
When window specifications are different, we will split the whole one to 
multiple. However, to do it, we need to have the corresponding optimizer rule 
to combine them back. Let me check if we need to add a rule. 

In addition, I plan to add the predicate pushdown for window into Optimizer 
first. This is still missing now. I also found Hive just added one. Thanks!

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13396:


Assignee: Apache Spark

> Stop using our internal deprecated .metrics on ExceptionFailure instead use 
> accumUpdates
> 
>
> Key: SPARK-13396
> URL: https://issues.apache.org/jira/browse/SPARK-13396
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
> metrics in class ExceptionFailure is deprecated: use accumUpdates instead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182029#comment-15182029
 ] 

Apache Spark commented on SPARK-13396:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/11544

> Stop using our internal deprecated .metrics on ExceptionFailure instead use 
> accumUpdates
> 
>
> Key: SPARK-13396
> URL: https://issues.apache.org/jira/browse/SPARK-13396
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
> metrics in class ExceptionFailure is deprecated: use accumUpdates instead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13396:


Assignee: (was: Apache Spark)

> Stop using our internal deprecated .metrics on ExceptionFailure instead use 
> accumUpdates
> 
>
> Key: SPARK-13396
> URL: https://issues.apache.org/jira/browse/SPARK-13396
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
> metrics in class ExceptionFailure is deprecated: use accumUpdates instead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table

2016-03-05 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182031#comment-15182031
 ] 

Dhaval Modi commented on SPARK-13699:
-

TGT_TABLE DDL:
CREATE TABLE IF NOT EXISTS tgt_table (col1 string, col2 int, col3 timestamp, 
col4 decimal(4,1), batchId string, currInd string, startDate timestamp, endDate 
timestamp, updateDate timestamp) stored as orc;

SRC_TABLE DDL:
CREATE TABLE IF NOT EXISTS src_table (col1 int, col2 int, col3 timestamp, col4 
decimal(4,1)) stored as orc;


INSERT STMT:
insert into table src_table values('1',1,'2016-2-3 00:00:00',23.1);
insert into table src_table values('2',1,'2016-2-3 00:00:00',23.1);
insert into table tgt_table values('1',2,'2016-2-3 00:00:00',23.1, '13', 'Y', 
'2016-2-3 00:00:00', '2016-2-3 00:00:00', '2016-2-3 00:00:00');
insert into table tgt_table values('1',3,'2016-2-3 00:00:00',23.1, '13', 'N', 
'2016-2-1 00:00:00', '2016-2-1 00:00:00', '2016-2-3 00:00:00');
insert into table tgt_table values('3',3,'2016-2-3 00:00:00',23.1, '13', 'Y', 
'2016-2-1 00:00:00', '2016-2-1 00:00:00', '2016-2-3 00:00:00');


> Spark SQL drops the table in "overwrite" mode while writing into table
> --
>
> Key: SPARK-13699
> URL: https://issues.apache.org/jira/browse/SPARK-13699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Dhaval Modi
> Attachments: stackTrace.txt
>
>
> Hi,
> While writing the dataframe to HIVE table with "SaveMode.Overwrite" option.
> E.g.
> tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table")
> sqlContext drop the table instead of truncating.
> This is causing error while overwriting.
> Adding stacktrace & commands to reproduce the issue,
> Thanks & Regards,
> Dhaval



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13704) TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost executors due to blocked heartbeats

2016-03-05 Thread Zhong Wang (JIRA)
Zhong Wang created SPARK-13704:
--

 Summary: TaskSchedulerImpl.createTaskSetManager can be expensive, 
and result in lost executors due to blocked heartbeats
 Key: SPARK-13704
 URL: https://issues.apache.org/jira/browse/SPARK-13704
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0, 1.5.2, 1.4.1, 1.3.1
Reporter: Zhong Wang


In some cases, TaskSchedulerImpl.createTaskSetManager can be expensive. For 
example, in a Yarn cluster, it may call the topology script for rack awareness. 
When submit a very large job in a very large Yarn cluster, the topology script 
may take signifiant time to run. And this blocks receiving executors' 
heartbeats, which may result in lost executors

Stacktraces we observed which is related to this issue:
{code}
"dag-scheduler-event-loop" daemon prio=10 tid=0x7f8392875800 nid=0x26e8 
runnable [0x7f83576f4000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0xf551f460> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0xf5529740> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.read1(BufferedReader.java:205)
at java.io.BufferedReader.read(BufferedReader.java:279)
- locked <0xf5529740> (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:728)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
at 
org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
at 
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
at 
org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at 
org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at 
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:210)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:189)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:189)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:158)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:157)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:187)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:161)
- locked <0xea3b8a88> (a 
org.apache.spark.scheduler.cluster.YarnScheduler)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:872)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

"sparkDriver-akka.actor.default-dispatcher-15" daemon prio=10 
tid=0x7f829c02 nid=0x2737 waiting for monitor entry [0x7f8355ebd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.scheduler.TaskSchedule